Get Top 3 from Each Hive Group

Apache Hive uses window function and subqueries to handle top N problems. It’s difficult to write the program:

SELECT *
FROM (
    SELECT g, ROW_NUMBER() OVER(PARTITION BY g
    ORDER BY x) group_id RN FROM A
)
WHERE group_id <= 3
ORDER BY g ASC, group_id ASC

 

It’s simple and easy to do this in SPL (Structured Process Language). Only a one-liner is sufficient:

A.group(g;~.top(-3;x)).conj(#2)

 

* To get rows where a specified field value is greater than the median from each group;

* To get the maximum growth rate in each group;

* To get groups where values of a specified field rises consecutively;

SPL is good at handling these types of calculations over subsets or detailed data in a group. More explanations and examples can be found in Getting Top N Problem Optimization in SPL, Intra-group Calculation Optimization in SPL and Inter-row Calculation Optimization in SPL.

 

esProc is a desktop programming tool. It’s easy to install, and it’s convenient to debug code and view data in it:

Download esProc

esProc is powerfully independent yet integration-friendly. You can easily embed an SPL script into any Java program. Refer to How to Call an SPL Script in Java to learn details.