Puzzles & Games with esProc: WordCount

 

  Writing a word count program is the most common exercise for getting familiar with a distributed system, such as Hadoop. Here we’ll illustrate how we perform word count with esProc.

Let’s begin with the single threading.

  D:\files\novel directory contains novels in text format. Task: Find the most frequently used words in the novels.

  undefined

  esProc handles the task in a long query:


A

1

=directory@p(“D:/files/novel”).(file().read().words().groups(lower():Word;count(~):Count)).merge(Word).groups@o(Word;sum(Count):Count).sort@z(Count)  

  It’s simple, isn’t it? Here’s A1’s result:

  undefined

  To make it easier to understand, we split the long query into multiple steps to execute:


A

B

C


1

=directory@p(“D:/files/novel”)  

[]

=now()


2

for A1

=file(A2).read().words()



3


=B2.groups(lower():Word:count():Count)  



4


>B1=B1

[B3]


5

=B1.merge(Word)

=A5.groups@o(Word;sum(Count):Count).sort@z(Count)  

=interval@ms(C1,now())


  A1 gets text files in the specified directory:

  undefined

  Line 2 ~ line 4: Count each word in each file. B2 reads in each text file and split it into individual words:

  undefined

  B3 count the number of each word in the current file. It converts words to lowercase to perform a uniform count and sorts the result set in alphabetical order:

  undefined

  After the count over all files is finished, B4 writes the results to B1. Below is B1’s result:

  undefined

  A5 performs a merge grouping over B1’s sequences by words:

  undefined

  5 performs sum over A5’s Count by words and sorts records by Count in descending order. The final result is the same as that of the single query:

  undefined

  C1 and C5 records the start time and the end time of the execution to estimate the time it takes to perform the count:

  undefined

  It’s fast!

Then let’s try the multithreading.

  esProc script for performing parallel processing:


A

B

C

1

=directory@p(“D:/files/novel”)  


=now()

2

fork A1

=file(A2).read().words()


3


=B2.groups(lower(~):Word;count(~):Count)  


4

=A2.merge(Word)

=A4.groups@o(Word;sum(Count):Count).sort@z(Count)  

=interval@ms(C1,now())

  The most obvious difference is that A2’s for statement is replaced by fork statement. C4 estimates the time it takes to perform the count:

  undefined

  It’s much faster (My computer is dual-core and the performance increase result is satisfactory).

  The fork statement automatically switches to the iterative multithreaded processing, saving you the trouble of managing threads.

Let’s move on to solve the problem with cluster computing.

  Here I simulate the server cluster using processes on a computer. In esProc\bin path under esProc’s installation directory, we can run esprocs.exe to start or configure servers:

  undefined

  Before starting a server by clicking on Start button, we click Config button to configure information for it, such as the servers’ IP addresse and port on the Unit tab, as shown below:

  undefined

  After configuration is done, we start the server from Unit Server window. Run esprocs.ext again to start another server until all servers are in place. The three servers correspond to the configured IP addresses and ports in order. Now a cluster is ready.

  Since we simulate a cluster on a single machine, the parallel servers share same path. The paths are different if they are remote servers.

A B C

1

[192.168.10.229:4001,192.168.10.229:4004,
192.168.10.229:4007]
[D:/files/novel1,D:/files/novel2,
D:/files/novel3,D:/files/novel4]

2

fork B1;A1

=directory@p(A2)


3


fork B2

=file(B3).read().words()

4



=C3.groups(lower():Word;count():Count)  

5


return B3.merge(Word)


6

=A2.merge(Word)

=A6.groups@o(Word;sum(Count):Count).sort@z(Count)  


  The program uses four file paths as parameters to execute four subtasks that count the frequency of each word in a file in a certain path. Add the server addresses after fork and let esProc assign the subtasks to servers aggregate the results. Coders don’t need to think about these details. At last, B6 gets the final result:

  undefined

  We can view the task distribution and execution on server windows:

  undefined

  undefined

  undefined

  It’s simple, easy, efficient and cool. We have gotten this done when somebody else is busy setting up a Hadoop cluster.