Large Text File Processing

Question

I am new to Revolution R and trying to open a large CSV file of 13GB. It is a dataset from Kaggle competition. R is not able to open it, so I turned towards Revolution R enterprise. How can I read a CSV file on my system and convert it into XDF format and load it in Revolution R enterprise to run further analysis.

My file path is “C:\Users\admin\Desktop\Kaggle\dog_1_both_marked.csv”.

I tried something like this but got error.

sampleDataDir <- rxGetOption("Kaggle") 

inputFile <- file.path("C:\\Users\\admin\\Desktop\\Kaggle\\dog\_1\_both\_marked.csv", "dog\_1\_both\_marked.csv") 

outputFile <- file.path(tempdir(), "basicClaims.xdf") 

rxTextToXdf(inFile = inputFile, outFile = outputFile, overwrite = TRUE) 

rxGetInfo(data = outputFile, getVarInfo = TRUE, numRows = 100000) 

file.remove(outputFile)

 

Answer

R is able to retrieve a large text file segment by segment and processes them with parallel processing. But the code is complicated and executes poorly because R is intended to perform mathematics and statistics operations. It isn’t good at handling structured processing. A better tool is SPL (Structured Process Language). Below lists structured computations with SPL:

1.       Open a large text file with cursor

A

1

=file("C:\Users\admin\Desktop\Kaggle\dog_1_both_marked.csv").cursor@t()

2.       Data query

A

1

=file("C:\Users\admin\Desktop\Kaggle\dog_1_both_marked.csv").cursor@t()

2

=A1.select(BIRTHDAY>=date(1981,1,1)  && GENDER=="F")

3.       Grouping & aggregation

A

1

=file("C:\Users\admin\Desktop\Kaggle\dog_1_both_marked.csv").cursor@t()

2

=A1.groups(DEPT:dept;count(~):count,sum(SALARY):salary)

4.       Sorting

A

1

=file("C:\Users\admin\Desktop\Kaggle\dog_1_both_marked.csv").cursor@t()

2

=A1.sortx(BIRTHDAY)

 

··· ···

For more examples, please refer to Text Files.