Common Traversal Syntax

A traversal algorithm is essential for performing set-oriented operations, like the aggregate operations including SUM, COUNT, MAX and MIN, the conditional filtering, or the formation of a set based on members of another set. In most cases, the set-oriented syntax requires very short code (often one statement instead of a multiline code block) to describe a traversal operation. It’s necessary for us to review the various traversal scenarios to design a set of coherent syntax rules for implementing the operations.

In the following, we look at the traversal scenarios from simple to complex while discussing the SQL performance in handling them.

1. Operations over set members

Take the calculation of the sum of the members as an example, which is the simplest scenario.

We can handle it using the usual style of the functional syntax, which uses the to-be-traversed set as the parameter to get the return value, like sum(A), which calculates the sum of the members of set A. We can also adopt the object-style syntax to write the function as A.sum().

2. Referencing set members

In another scenario, we want to calculate the sum of squares. Then how to express the square?

Here the Lambda syntax previously mentioned when we talking about the set-oriented syntax rises to the occasion. Calculating squares is in essence a function, which uses the current members of a set being traversed as the parameter and returns its square. The Lambda syntax allows embedding the function in the traversal statement in the form of an expression. Thus one statement is sufficient in getting the task done. But there’s one more issue. What kind of identifier or sign should we use to represent the current member?

Obviously it isn’t wise to define a parameter name in advance, because that way the Lambda expression will become bloated and lose its intended conciseness. Yet some programming languages adopt the method. We don’t think it’s a good idea. It isn’t a good way to use a fixed identifier, either, since the identifier is inconvenient to use if it is long, and is very likely have the same name as some other local variable and leads to misunderstanding if it is short. Our suggestion is to use a special symbol to represent the referenced member.

For example, we can use “~” to represent the current member. Then the calculation of squares can be expressed as A.sum(~*~), which is simple and easy to understand, or A.(~*~).sum(), which calculates each member’s square to form a new set first and then the sum of the members of the new one. There are two steps, in which“~”is used to describe the square in the functional expression in the first step but the sign isn’t needed in the second step.

3. Referencing fields in structured data processing

But SQL, the traditionally set-oriented language, doesn’t define a special sign or an identifier to represent the current member during a traversal process. Then how does SQL handle the computing scenario in section 2?

SQL doesn’t have the general concept of set that allows members of any type. A SQL set is in fact a table, whose members are records of the same structure. Though the SQL system defines the concept of record, it can’t use it as a data type, like, for reference. To traverse a set with single-value members in SQL, we have to dress up each single value as a single-field record and perform a traversal over a table consisting of single-field records. All computations are performed over one or multiple fields, instead of a record.

But why this has anything to do with SQL’s lack of a special sign to represent the current member?

The set-oriented syntax for structured data processing requires a concise way to reference fields. SQL provides an easy-to-use mechanism for directly referencing fields. But as it is only able to compute over fields, it’s natural that the language doesn’t offer a method for referencing the current member (record). For example, to calculate squares in SQL is definitely an operation over a certain field, it is meaningless to perform the operation over a record (a member of a SQL set)

SQL achieves a simpler syntax by sacrificing a more powerful ability to phrase set-oriented operations. For a programming language supporting a set with generic type members, an identifier like “~” is necessary. Besides, it’s more convenient to handle structured data if the SQL ability of directly referencing fields can be fully supported. To calculate the total amount over a sales table, for instance, it’s simpler and more intuitive to write the expression as “UnitPrice*Number” than to write it as“~.UnitPrice*~.Number”. The SQL style is really worth learning.

4. The rules of nested references

In essence a traversal operation is a loop. Since a loop statement may have multiple layers, it is possible that a traversal needs nested references. To calculate the intersection of set A and set B, a simple algorithm is to traverse members of A to see if any of them appears in B (it’s also a traversal operation for set B). Thus there will be a double-layer traversal.

In this case, there will be ambiguity about the object represented by the “~”- whether it is the current member in set A or that of set B. There needs to be a definite syntax rule about this.

Generally the proximity principle is adopted. By default, “~”represents a member of the set in the inner layer, if no definite instruction is given; while the current member of the set in the outer layer needs to be explicitly indicated. The expression for calculating the intersection can be written as A.select(B.count(~==A.~)>0), where“~”represents the current member of set B by default but the current member of set A is A.~, for the purpose of differentiation.

SQL allows the direct reference of the fields during structured data processing, which may cause ambiguity between sets in different layers. In those cases, the proximity principle applies, too. If there are namesake fields in the inner and outer layer tables, a field is treated as one in the inner layer table by default. In referencing a field in the outer layer table which has the same name as one in the inner layer table, the table name needs to be attached. If there aren’t namesake fields in both inner and outer layer tables, table names are not necessary because ambiguity doesn’t exist.

As basic as it is, the traversal operations require some special attention in designing its syntax rules. Generally speaking, SQL does well. It is convenient in describing the common traversal operations, though the language lacks generic data types.