Non-equi Group Operations

The previous article discusses the essence of the grouping operations, which divides a set into multiple subsets according to a certain rule. It focuses on the way of getting a true grouping operation restored, without studying the grouping rules. Those in the examples are defined by one ore multiple fields (expressions). That is the SQL way.

It is called the equi-grouping.

Mathematically, the equi-grouping defines an equivalence relation over a set, where members (records) whose grouping fields (expressions) have the same values are considered to be equivalent.

An equivalence relation has the following properties:

1) Interchangeability. If a=b, then b=a;

2) Transitivity. If a=b, b=c, then a=c;

3) Exclusivity. For any a and b, only one of the equations a=b and a!=b holds.

An equivalence relation on a set partitions a set into non-overlapping subsets – which is a complete partition of a set. In each subset, members are considered to be equivalent.

A partition of a set has two characters:

1) The subsets are non-empty;

2) Every member is included in one and only one of the subsets.

An equi-grouping operation exactly meets the definition of an equivalence relation. And its result set is definitely a set of partitions.

Now let’s think about these. Does there exist a non-equi grouping and incomplete partition? Can a partition of a set be generated in a way that isn’t equi-grouping? Do they have values in the business practices? Yes, Definitely.

To count the number of the male and female employees, for example, we can use the following code:

SELECT gender,COUNT(*) FROM employee GROUP BY gender

But if the employees in a company are all male or female, the result set will have only one record, which isn’t what we expect.

We can design a different grouping method to solve the issue. Define a base set, compare its elements with an attribute (a specified field or an expression) of the members in the to-be-grouped set, and put members that make the comparison result true into the same subset. The number of the subsets is equivalent to that of the members in the base set. This type of grouping is called alignment grouping.

To count the male and female employees using the alignment grouping:

a=[Male,Female] // Base set

g=employee.align(a,gender) // Use the align function to implement the alignment grouping for splitting the original set

g.new(a(#),~.len()) // Calculate aggregate values based on the grouping subsets

Similar examples are common in the real-world statistical practices, like the aggregation by regions or by departments. They can all be handled with alignment grouping by defining a base set. Often members of the result set are required to be arranged according to the order of the members of the base set. An equi-grouping can’t ensure a specified order. A second sorting is thus needed (the sorting is also based on the specified base set, which can’t be obtained directly from the original set).

The alignment grouping may produce an empty subset, and won’t make sure that each member belongs to a subset (like the occasion where certain members aren’t included in the base set). But with alignment grouping, each member will appear in only one subset.

A general case of the alignment grouping is the enumerated grouping.

An enumerated grouping specifies a set of criteria and uses each of the members of the to-be-grouped set as the parameter to compute them while putting a member to a corresponding subset if the result is true. The subsets of the result set and the criteria correspond to one another in order.

To count the employees by age groups, for example:

a=[?<=30,?<=40,?>40] // Use the question mark (?) to represent a to-be-introduced parameter

g=employee.enum(a,age) // Use the enum function to implement the enumerated grouping for splitting the original set

….

There are not a few similar examples in the real-world businesses.

Enumerated grouping and alignment grouping are similar in that both need to first define a base set. Technically the alignment grouping is a special category of the enumerated grouping. Where they differ is that the enumerated grouping may generate a result set where a member belongs to more than one subset, which is called duplicable grouping.

a=[?<=30,?>20 && ?<=40,?>50] // the criteria overlap

g=employee.enum(a,age)

The duplicable grouping is relatively rare in the practical situations, but the knowledge helps better grasp the essence of the group operations.

It looks like that the alignment grouping and enumerated grouping differ much from the SQL GROUP BY clause. But once you have a true understanding about the essence of the group operation, you will know that they are the same thing - that is, splitting a set into multiple subsets in different ways.

There are also grouping rules that are not completely based on the attributes of the members of a data set. But likewise, they work to break a set apart into multiple subsets. We’ll discuss them in subsequent articles.

Two more questions: Can SQL handle all situations by providing only the equi-group operations? What can we do in dealing with computing scenarios involving alignment grouping and enumerated grouping in SQL?

Actually the computing capability of SQL is complete. Both types of non-equi-grouping computation can be converted into the equi-grouping computation. The process, however, isn’t simple.

To handle an alignment grouping scenario, we can first perform a LEFT JOIN over the base set and the to-be-grouped set and then a GROUP BY over the result set. Make sure you use a LEFT JOIN because a JOIN may lose the empty subset(s) and a FULL JOIN will generate members that don’t match the base set. The method applies to an enumerated grouping, but the statement will be more complicated. Rules for a JOIN need to be designed according to the specified enumerated conditions, and thus there is no standard syntax for handling the enumerated grouping scenarios.

Note: We discuss the handling of group operations based on the assumptions that the to-be-grouped set is already in place and that its members can be directly accessed anytime.