Set-oriented Syntax – Using SQL as an Example

SQL, the most commonly used language for structured data processing, is much more convenient in describing the basic operations than high-level languages like Java, despite its awkwardness in detail treatment. That’s because SQL is a set-oriented language, but high-level languages are not. The following looks at features of a set-oriented language based on SQL syntax while using Java for comparison (other high-level languages are similar).

Set-oriented ability

Usually the structured data exists in batches (i.e. in the form of sets). To conveniently process the structured data sets, it’s essential for a programming language to support set-oriented data manipulation well.

The high-level languages like Java don’t provide ready-to-use class library for performing set-oriented operations. Though they supply array type (which is equivalent to the sets), they define only a very limited number of basic operations. Even a simple sum needs 4-5 lines of loop statement; for operations like filtering and grouping and aggregation, often hundreds of lines of code are needed. The tediously long code is an obstacle to grasping the overall structure of the algorithm because the critical steps fade into oblivion among the handling of innumerable details.

SQL, by contrast, offers a variety of set-oriented operations, including aggregate operations like SUM/COUNT, WHERE clause for filtering, GROUP statement for grouping, and basic operations like intersection, union and difference, making it possible to produce short code.

Expression parameter

Does the set-oriented ability alone is enough for structured data processing? Will the high-level languages achieve the same result as SQL does if we develop a set of class library for them to handle set-oriented operations?

Yet it needs more to become a SQL equivalent.

Take the filtering operation as an example. A filtering needs a criterion to retain members that satisfy the specified criterion – technically, members that make the the criterion true. In SQL, the criterion is an expression, like WHERE x>0, meaning getting members over which the result of x>0 is true. The expression is computed during the traversal of set members rather than before the execution of the SQL statement. In essence, the expression is a function - one that uses the current member as the parameter. The WHERE clause, for instance, uses a parameter which is a function defined in the form of an expression.

The Java syntax doesn’t support this simple phrasing. Though it allows passing a function to another one as a parameter, the phrasing is complicated and the code is bloated because a function needs to be specifically defined separately. If an expression is directly written into a function as the parameter, it will be computed before the traversal of each of the set members.

Obviously it’s convenient and intuitive for SQL to use an expression to define a function and pass it as a parameter.

There is a special term for this type of phrasing. It is Lambda syntax, or functional language.

SQL makes great use of the Lambda syntax. Apart from the operations like filtering where Lambda is a necessity, others become simple to perform with the syntax used. An aggregate function can use an expression to calculate the aggregate values while the operation specified by the function name is executed. The sum(x*x) function, for instance, gets the sum of squares, where the expression x*x is calculated during the execution of sum. With the absence of the Lambda syntax, we can first get a set consisting of the square of each of the members with the set-oriented processing, and then calculate the sums over the set. The non-Lambda phrasing is not as intuitive as the Lambda syntax because an expression computed over one member each time is easier to write and understand than one over a whole set,.

Direct reference of fields

The structured data constitutes not the single values but the records made up of field values.

In most cases, SQL allows referencing a field value in an expression parameter through the field name, rather than specifying the record containing the desired field value. A referenced field needs to be headed by its table name (or the table’s alias) only when there is a field (or there are fields) with the same name.

In Java, though we can implement the Lambda syntax by defining a function beforehand, only the current record can be passed to this function as a parameter and it needs to be specified before each field name in the function expression. For example, suppose the parameter representing the current record is x, then to calculate amount by multiplying unit price by the number, the expression will be like “x.unitprice*x.number”. In SQL, however, it is simply written as “unitprice*number”.

Behind SQL’s simple syntax is the sophisticated program design which enables the syntax to be correctly interpreted by the parser and properly executed. Some scripting languages don’t have the SQL-like phrasing feature though they support Lambda syntax which allows defining a function in the form of an expression and passing it as a parameter. Instead, they use the longwinded phrasing style x.unitprice*x.number”. A language that doesn’t support the direct reference of fields isn’t a professional structured data processing language.

Dynamic data structure

SQL also has a good support of dynamic data structure.

Often the returned results of structured data computations are also structured data. The structure of a returned result depends on the type of the operation, and can’t be decided before the code is written. That is why the ability to support dynamic data structure is needed.

Any SQL SELECT statement is able to produce a new data structure by adding or deleting fields, rather than by first defining a structure (or class). Without this ability, Java defines the structures (or classes) expected to appear while the code compiles; and, in principle, doesn’t support generating a data structure dynamically during the execution process.

The interpreted language

A compiled language doesn’t support dynamic data structures. Also, it isn’t suitable for implementing the Lambda syntax because the compiler can’t decide whether it should first calculate the parameter expression and then pass the value, or it should compile the expression as a function for parameter passing. More syntax notations are needed to tell the compiler what to do. An interpreted language allows a function to make the choice. The interpreted execution is another important feature of a set-oriented language.