Revisiting the Notion of a Group Operation

Group operations are common in SQL queries. Yet not everyone can fully comprehend their essence.

A group operation breaks down a set into multiple subsets according to a specified rule. This means its return value should be a set whose members are also sets. In most cases the aggregate values of the member sets (we call them grouping subsets), instead of these subsets themselves, are desired, making an aggregate operation a persistent companion of a group operation.

That is how SQL does. When a SQL GROUP BY clause is used with the SELECT statement, the latter must use an aggregate expression besides the grouping field(s). Another reason is that SQL is unable to return a data set consisting of set members due to its lack of the explicit set data type, thereby having no choice but to perform an aggregation. Deeply influenced by SQL, the industry has become accustomed to a group operation followed by an aggregate operation, ignoring the fact that the two operations are independent of one another.

In real-world practice, there are still some scenarios where the subsets instead of the aggregate values are more desirable.

To find the employees who were born on the same date, for example. The simple method is to group all employees by birthday, find the grouping sets that contain more than one member, and union them. In this case, getting the grouping subsets takes priority over obtaining the aggregate values (the number of members in each grouping subset).

SQL expresses the algorithm in a roundabout way by using a subquery and traversing the original data set twice.

SELECT * FROM employee WHERE birthday IN

(SELECT birthday FROM employee GROUP BY birthday HAVING COUNT(*)>1 )

Notice: Here we assume that the birthday field contains the birth dates. Normally a birthday doesn’t include the year, but the birthday field in a data table has the year part. So we need to convert the values of birthday field into month & day before performing the group operation and the conditional filtering. But it is hard to phrase a SQL IN clause involving two members because of the language’s incomplete set-orientation. The birthday field needs to be rewritten as a separate expression like month(birthday()*100+day(birthday)) to be able to be judged by the IN clause. The code is complicated.

A completely set-oriented syntax, however, can retain the grouping subsets, each of which is made up of members in the original set. But this needs the support of discreteness feature. By restoring the group operation and aggregate operation to two separate steps, the above computation can be expressed more concisely:

employee.group(month(birthday),day(birthday)).select(~.len()>1).conj()

In this expression, the sign “~” is used to represent the current member (this is introduced in a previous article discussing the traversal syntax), which is a grouping subset being traversed.

The expression groups the original set by month and day in the birthday field, gets the subsets that contains more than one member, and unions them. A second traversal is needed to perform the filtering. But since there is only the count operation and won’t involve a comparison that SQL needs, the overall performance will be much higher.

Even if only the aggregate values are desired, we may use the grouping subsets to compute different types of aggregate value and thus need to keep them. Discarding them after one aggregate operation means a re-grouping when needed. But performing a group operation is time-costly. Generally the HASH method is used to do it. It’s a complicated process to calculate and compare HASH values, compared with a simple traversal operation. Some badly optimized algorithms even need to sort the data before grouping it (which is the practice of not a few reporting tools), decreasing the performance by an order of magnitude.

To count the employees in each department and calculate the average age in those departments where there are more than 10 employees, for example. To do it in SQL, two statements are needed because a HAVING clause is required to achieve the second computing goal:

SELECT department, COUNT(*) FROM employee GROUP BY department

SELECT department,AVERAGE(age) FROM employee GROUP BY department HAVING COUNT(*)>=10

Here two group operations are needed.

If the grouping subsets can be retained, one group operation will suffice:

g=employee.group(department)

g.new(~.department,~.len())

g.select(~.len()>=10).new(~.department,~.avg(age))

There is another possibility, that is, we want only one type of aggregate value but the computation is complex. A block of code, instead of a simple SUM/COUNT operation, is needed. On those occasions, the grouping subsets are useful. But SQL is unable to keep them. There will be more examples to illustrate this in later articles.

The result of a group operation is a set whose members are also sets. This means the set and each subset can be further grouped.

g1=employee.group(year(birthday)) // Group by the birth year

g2=g1.group(year(birthday)%100\10) // Group all the subsets by decade

g3=g1.(~.group(month(birthday)) // Group each subset by the birth month

Both the second and the third steps return a set whose members are sets and where each subset also consists of sets. A three- or more-layered set is rare in the real-world practices, but we can use it to better understand the group operations and appreciate the set-based thinking.

SQL specifically designs the HAVING keyword to perform a filtering over the result set of a group operation. Many beginners, however, can’t grasp the essence of the clause and thus misapply it. In fact, the definition of HAVING clause is just like the WHERE clause. It is invented because SQL can’t retain the grouping subsets but the grouping and the aggregation needs to be written in a same query and, at the same time, to be differentiated from the WHERE clause. The clause would be unnecessary if the grouping subsets can be retained for a step-by-step computation.

Note: We discuss the handling of group operations based on the assumptions that the to-be-grouped set is already in place and that its members can be directly accessed anytime.