Order-based Group Operations (II)

In the previous article, we look at the significance of the order of members in a to-be-grouped set. Speaking of the orderliness of sets, then conversely, does the order of members in a result set has significance in real-life computations?

Yes, though not as important as the order of members in the original data set.

There are two aspects we need to consider about the orderliness of the result set of a group operation. One is in which order the subsets are returned, the other is the order of members in each grouping subset.

For the equi-grouping operation over an ordered set, it’s appropriate that the default order of members in a grouping subset should be consistent with the order of their first appearances in the original set. We can sort the result set by, say, the grouping field, to get a desired order, but we could lose the original order of members after they are grouped, or at least have difficulty in recovering it.

One of the examples is to count the frequencies of each word in a textbook. This is a simple equi-grouping operation, and the order of members of the result set should be by default in line with the order of their first appearances. This has practical meaning. In a class, a teacher teaches the words according to the order. This order, however, will be difficult to get if it isn’t returned with the result set of the group operation. To do the task, we need to assign a sequence number to each word according to the order of appearance, perform the grouping while getting the smallest sequence number for each unique word, and then sort the grouping subsets by the smallest sequence numbers, which will be discarded after the operation is completed. The process is roundabout and inefficient.

It is clear that the order of members in the result set of an alignment group operation and an enumerated one should be based on the specified base set. And it’s natural that the order of members in a result set of an order-based group operation should be according to the order of their creation.

As the unordered set-based SQL doesn’t define the order of members in a result set for group operations, it can’t ensure that a returned result set has the same order as the original set. In practical applications, most databases use the HASH algorithm to implement a group operation. In this case, the order of members in a result set is the order of HASH values. Since the order of hash values hasn’t any practical uses, a re-sorting is required, if needed. We need to generate sequence numbers for members of the original data set to get the sorting condition and use it to perform the group operation. This is complicated and slow. Some databases perform group operations by first sorting the data set, generating a result set where members are ordered by the grouping field (an expression). This order is useful. But there is still much difficulty if we want to recover the order complying with that of the original data set.

SQL can achieve the result of alignment grouping and enumerated grouping using LEFT JOINs. But both the HASH algorithm and the sorting method will lose the order specified by the base set, which is essential for these two types of group operations. To recover the specified order, we need to attach a sequence number to each member of the base set, changing a simple single-value set into a set consisting of multi-field records. Moreover, it’s a hassle to maintain the sequence numbers if a member is added to or deleted from the base set, since the sequence numbers of all members following the newly-added/deleted member should be adjusted. In a word, it’s complicated to do alignment grouping and enumerated grouping in SQL.

In principle, the order of members in each grouping subset should by default be consistent with their order in the original set. Whether the order is useful or not depends on the subsequent operations.

Since SQL doesn’t care about the order for group operations, it automatically performs an aggregate operation – one where the order of members doesn’t have an impact on the result set, like SUM, COUNT and other common aggregate computations - after each group operation. For SQL, the order of members of a grouping subset is meaningless.

The order, however, plays a role in certain aggregate computations. One instance is to get the time difference between the last two logins of each user based on a login log (by default a log file is sorted by time). To do this, we need to group the file by user, get the last two records from each subgroup, and calculate the difference. Here the order of members in each subset is important. About the aggregate operations, we’ll discuss them blow by blow in subsequent articles.