# Get Random Small Samples from a Huge Text File

Here are the first 5 records in text file huge.txt:

 f1 f2 yewhhgfifsbplrxankqazzewzkhfxjetiprfvyinchmdventatkry lwxazkmczmpcluechdtfgwapgvyzfxqczcuvadkfqrcciptmpo viqxbdjjzkdcytdnjiuexottvgdjkafhykbotjsupyuybvgycqhfsdlypuftbezga mmoermrlbovwmfnxgctizucfccatwlvugnqvikhbgaqvamwbzqluwavgcjtonutairrafrpywtwtpocgltmfrxz plhdyslghehlptlsczizhjbtcqwasvspjqyeifsnqagqovvdukxftsp tlisnnguudbqgrupqpoqjfshldpuwjdkfeizhkfwsvmdspswusmclhqzzxaumvwrerbsl bltnilcncwgnsyxeosdtytvpdbxuiwukdqpgvvbihoqvvmhogmffzpivuysbhgitfqxptyuofsukmz ajojwbcfptahjetpnmkbsfrblubvvjxyestplybzpxxwsrppgteoreckkscrsu … …

The size of the whole text file is 200GB. We want to get random samples of 10,000 records from it.

1. Write script sample.dfx in esProc:

 A 1 =file("huge.txt") 2 =A1.cursor@t() 3 =A2.fetch(100).(len(f1)+len(f2)).avg() 4 =A1.size()/A3 5 =10000.(rand(A4)) 6 =A5.(A1.cursor@t(;~:A4).fetch(1)).conj()

A1 Open the text file huge.txt.

A2 Create cursor of the file.

A3 Get the first 100 records and calculate the average length of them (number of bytes).

A4 Estimate the number of records and bytes in the file, and the average number of bytes in each record.

A5 Generate 10,000 random records.

A6 Divide the text file into a number of (n) segments; n is the estimated record count. Get 10,000 segments randomly and then get the first row of each segment.

2.  Execute the script. A6 returns the final result:

The task has a relatively simple solution and a rough sampling result. Since the estimated total number of segments is not precise, if the value is greater than the actual number, the final number of samples is probably slightly less than 10,000.