Get Random Small Samples from a Huge Text File

Here are the first 5 records in text file huge.txt:

f1

f2

yewhhgfifsbplrxankqazzewzkhfxjetiprfvyinchmdventatkry

lwxazkmczmpcluechdtfgwapgvyzfxqczcuvadkfqrcciptmpo

viqxbdjjzkdcytdnjiuexottvgdjkafhykbotjsupyuybvgycqhfsdlypuftbezga

mmoermrlbovwmfnxgctizucfccatwlvugnqvikhbgaqvamwbzqluwavgcjtonutairrafrpywtwtpocgltmfrxz

plhdyslghehlptlsczizhjbtcqwasvspjqyeifsnqagqovvdukxftsp

tlisnnguudbqgrupqpoqjfshldpuwjdkfeizhkfwsvmdspswusmclhqzzxaumvwrerbsl

bltnilcncwgnsyxeosdtytvpdbxuiwukdqpgvvbihoqvvmhogmffzpivuysbhgitfqxptyuofsukmz

ajojwbcfptahjetpnmkbsfrblubvvjxyestplybzpxxwsrppgteoreckkscrsu

The size of the whole text file is 200GB. We want to get random samples of 10,000 records from it.

It’s effortless to do this with esProc SPL. Download esProc installation package HERE.

1. Write SPL script sample.dfx in esProc:

A

1

=file("huge.txt")

2

=A1.cursor@t()

3

=A2.fetch(100).(len(f1)+len(f2)).avg()

4

=A1.size()/A3

5

=10000.(rand(A4))

6

=A5.(A1.cursor@t(;~:A4).fetch(1)).conj()

A1 Open the text file huge.txt.

A2 Create cursor of the file.

A3 Get the first 100 records and calculate the average length of them (number of bytes).

A4 Estimate the number of records and bytes in the file, and the average number of bytes in each record.

A5 Generate 10,000 random records.

A6 Divide the text file into a number of (n) segments; n is the estimated record count. Get 10,000 segments randomly and then get the first row of each segment.

2.  Execute the script. A6 returns the final result:

The task has a relatively simple solution and a rough sampling result. Since the estimated total number of segments is not precise, if the value is greater than the actual number, the final number of samples is probably slightly less than 10,000.

Read How to Call an SPL Script in Java to learn about the integration of an esProc SPL script with a Java program.