"Here are the first 5 records in text file huge.txt: f1 f2 yewhhgfifsbplrxankqazzewzkhfxjetiprfvy .."

blackduckie RaqForum 28 No.
1 Reply • 291 View • 3 Years ago

Get Random Small Samples from a Huge Text File

Here are the first 5 records in text file huge.txt:

f1	f2
yewhhgfifsbplrxankqazzewzkhfxjetiprfvyinchmdventatkry	lwxazkmczmpcluechdtfgwapgvyzfxqczcuvadkfqrcciptmpo
viqxbdjjzkdcytdnjiuexottvgdjkafhykbotjsupyuybvgycqhfsdlypuftbezga	mmoermrlbovwmfnxgctizucfccatwlvugnqvikhbgaqvamwbzqluwavgcjtonutairrafrpywtwtpocgltmfrxz
plhdyslghehlptlsczizhjbtcqwasvspjqyeifsnqagqovvdukxftsp	tlisnnguudbqgrupqpoqjfshldpuwjdkfeizhkfwsvmdspswusmclhqzzxaumvwrerbsl
bltnilcncwgnsyxeosdtytvpdbxuiwukdqpgvvbihoqvvmhogmffzpivuysbhgitfqxptyuofsukmz	ajojwbcfptahjetpnmkbsfrblubvvjxyestplybzpxxwsrppgteoreckkscrsu
…	…

The size of the whole text file is 200GB. We want to get random samples of 10,000 records from it.

It’s effortless to do this with esProc SPL. Download esProc installation package HERE.

1. Write SPL script sample.dfx in esProc:

	A
1	=file("huge.txt")
2	=A1.cursor@t()
3	=A2.fetch(100).(len(f1)+len(f2)).avg()
4	=A1.size()/A3
5	=10000.(rand(A4))
6	=A5.(A1.cursor@t(;~:A4).fetch(1)).conj()

A1 Open the text file huge.txt.

A2 Create cursor of the file.

A3 Get the first 100 records and calculate the average length of them (number of bytes).

A4 Estimate the number of records and bytes in the file, and the average number of bytes in each record.

A5 Generate 10,000 random records.

A6 Divide the text file into a number of (n) segments; n is the estimated record count. Get 10,000 segments randomly and then get the first row of each segment.

2. Execute the script. A6 returns the final result:

The task has a relatively simple solution and a rough sampling result. Since the estimated total number of segments is not precise, if the value is greater than the actual number, the final number of samples is probably slightly less than 10,000.

Read How to Call an SPL Script in Java to learn about the integration of an esProc SPL script with a Java program.

SPL Official Website 👉 http://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProc

SPL Learning Material 👉 http://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/ydhVnFH9

Youtube 👉 https://www.youtube.com/@esProc_SPL

text files(3) big data(1)

Analyst

blackduckie • 291 View • 3 Years ago