Perform a Huge Number of String Comparisons with esProc
【Question】
There are 1001 numeric strings. Their lengths vary but all fall within a range of 0~9. For each string, I need to find among the other 1000 strings those having the most common numbers and those having the most different numbers.
By this I mean if both string 1 and string 1001 contain 1, they have one common number; if only string 1 contains 1 but string 1001 doesn’t, they have one different number.
My requirement is:
To achieve the above algorithm with the minimum steps or within the shortest time in MongoDB since the numeric strings is updated every 3 minutes.
My problem:
If the number of the strings doubles to 2001 or redoubles to, say 7001, can I use the same algorithm to achieve my goal?
【Answer】
Comparing each string to the rest of 1000 strings or even more can be a terrible work. That is beyond MongDB’s ability. To do this in the NoSQL database we have to retrieve data out to handle it. It’s inconvenient to hardcode the algorithm in Java. So here we try to do this with esProc. It’s much easier and simpler. Below is the script:
A |
B |
C |
D |
|
1 |
=connect(“test”) |
=A1.query@x(“select str from x”) |
||
2 |
=create(str,md,ms) |
|||
3 |
=B1.(~.str.split()) |
>debug(now()) |
||
4 |
for A3 |
>diff=same=0 |
||
5 |
for A3 |
if B5 != A4 |
=A4\B5 |
|
6 |
>diff=[D5.len(),diff].max() |
|||
7 |
>same=[A4.len()-diff,same].max() |
|||
8 |
=A2.record([A4.concat(),diff,same]) |
|||
9 |
>debug(now()) |
A1,B1: Retrieve the source data out and store it in B1’s table sequence.
A2: Create the result table sequence.
A3: Split each numeric string in B1’s source table sequence into an array of characters and return a sequence of arrays.
A4: The outer loop: get the first member in A3 to compare with every one of the rest of members.
B4: The initial value of member 1’s variable is 0; record the common members count and different members count during the loop.
B5: The inner loop: skip member 1 to get the next member to compare it with the previous member character by character.;
D5,D6: Get the different members count and always store the largest value.
D7: Calculate the difference between A4 and dif, which is the common members count (same) and always store the largest value. The inner loop is over.
B8: Record diff value and same value in A2’s table sequence. The outer loop is over.
With the string by string comparison, the performance decreases as the string count increases if the logic remains unchanged. According to B3 and A9, it takes between 80 and 110 seconds to complete the comparison of 1001 strings (result of multiple executions). When the string count reaches to 2000, the execution time will be over 6 minutes.
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/2bkGwqTj
Youtube 👉 https://www.youtube.com/@esProc_SPL