Extension of JOINs: Dimension-oriented Query Syntax

With the definition ofdimensions, we can sort out the simplified JOIN syntax we talked about earlier.

First let’s define a field’s dimension:

1)A dimension field’s dimension isitself;

2)A foreign key field’s dimension is the dimension of the associative field in its foreign key table;

3)A measure fielddoesn’t have a dimension.

It is a recursive definition.

Then we redefine homo-dimension tables and parent-and-child tables using dimensions:

Homo-dimension tables: Table A and table B are homo-dimension tables if the set of the dimension(s) of table A’s primary key field(s) is equivalent to that of the dimension(s) of table B’s primary key field(s);

Parent-and-child tables: For table A and table B, if the set of the dimension(s) of table A’s primary key field(s) is the proper subset of that of the dimension(s) of table B’s primary key field(s), then table A is table B’s parent table and table B is table A’s child table.

According to the two definitions, it’s easy to get the following conclusions:

A table and the homo-dimension table of its homo-dimension table are homo-dimension tables; the relationships between them are equal;

The homo-dimension table of a table’s parent/child table is also its parent/child table; the child table of a table’s child table is still its child table.

Finally, we havegeneral fields:

1) The field(s) of a table is/are its general fields;

2)For table A, if one of its general fields is an ordinary field in table B, then the fields of table B’s homo-dimension table are also table A’s general fields;

3)If a general field is a foreign key field, the fields of the foreign key table to which it points are general fields of the table where it stays;

4)The dimension function of a general field is still a general field.

This is still a recursive definition.

We’ll use the examples cited in the previous articles to help you better understand these concepts:

SELECT * FROM employee WHERE nationality=‘US’ ANDdepartment.manager.nationality=‘Chinese’

SELECT id,name,salary+allowance FROM employee

department.manager.nationality,salary and_allowance_ are all the_employee_ table’s general fields.

With the definition of general fields, it is natural that JOINs can be eliminated as we discussed earlier. With SQL syntax, a general field can be taken as an ordinary field to be referenced, which enables foreign key attributization and interconnection of homo-dimension tables, and a parent table’s set type general field in its child table can be directly referenced in an aggregation, which realizes the integration of the parent and child tables. This is easy to understand through the examples in previous articles.

Dimension is at the core of this reformed syntax. We name it Dimension Query Language (DQL) in order todifferentiate from SQL.

It’s time to mend the hole in the dimension alignment syntax. This is the example we used in that article:

SELECT Contract.SUM(price), Payment.SUM(amount), Invoice.SUM(amount)

FROM Contract GROUP BY date FULL JOIN Payment GROUP BY date FULL JOIN Invoice GROUP BY date

This query summarizes contract amount, payment amount and invoice amount by date. But it selects only the aggregate fields (expressions) without a key date field. This will produce a hard to understand result set.

All of the three tables to be joined have a date field. It’s inappropriate to place any of them in the SELECT partbecause the date information it contains could be incomplete. Besides, the relationships between three tables are equal. There’s no reason to make any of them special. SQL handles the case by using the expression_coalesce(Contract.date,Payment.date,Invoice.date)_  in the query (which is the Oracle syntax). This makes the query complicated.

By abstractingdimensions from the database structure, it’s easy to solve the issue. As thethree date fields have the same dimension, we name the dimension DATE. Then the above query can be written like this:

SELECT Contract.SUM(price), Payment.SUM(amount), Invoice.SUM(amount) ON DATE

FROM Contract GROUP BY date FULL JOIN Payment GROUP BY date FULL JOIN Invoice GROUP BY date

An ON clause is added to specify the dimension to which the three date fields are aligned. Intelligently, dimension values will be selected to the result set and null values will be handled.

Similarly, the example with aggregations by area should be handled like this:

SELECT Sales.COUNT(1), Contract.SUM(price) ON AREA

FROM Sales GROUP BY area FULL JOIN Contract GROUP BY customer.area

A field aligning to a dimension can be a general field.

With the concept of dimensionfunction, we can further simplify some queries.

Still the above triple-table dimension alignment example. If we want to perform the aggregations by month instead of the general date, then the query can be written as:

SELECT Contract.SUM(price), Payment.SUM(amount), Invoice.SUM(amount) ON MONTH

FROM Contract GROUP BY month(date) FULL JOIN Payment GROUP BY month(date) FULL JOIN Invoice GROUP BY month(date)

The_month()_function is a dimension functionthat uses date values as the parameter to return the month dimension values.

The dimension functions are defined when the data structure is designed and thus are ready to use. When the alignment dimension is known, a suitabledimensionfunction will be automatically located according to the fields being aligned. So thequery is clear even if the dimension function isn’t explicitly written:

SELECT Contract.SUM(price), Payment.SUM(amount), Invoice.SUM(amount) ON MONTH

FROM Contract GROUP BY date FULL JOIN Payment GROUP BY date FULL JOIN Invoice GROUP BY date

To change to another level within a dimension, we just need tochange the ON clause while leaving the GROUP BY part what it is.

As we know, pre-aggregation is a common practice to increase the performance of multidimensional analysis.Aggregate values over measures are pre-calculated according to the possible dimension combinations used in analysis and stored for direct references when needed, instead of performing a calculation from scratch by traversal. But it is unfeasible to perform pre-aggregations based on all dimension combinations (because that will consume too much storage space). Usually only the most commonly used dimension combinations areselected for pre-aggregation.

This raises two issues. 1. How do the pre-aggregation data and the base data correspond?  2. How do we know which dimension combinations are the most commonly-used ones?

It isn’t difficult to handle the two issues under the SQL system if thebase data is a single table without joins.The pe-aggregated data corresponds to different dimension combinations in the table. From the historical analyses, we can find the most commonly used dimension combinations to help facilitate the data pre-aggregation process.

If the base data involves join operations, we need to create JOINs in the multidimensional analytical SQL queries. This makes the situation much more complicated. The relationalalgebra-based SQL JOINs have trouble in describing the correspondence between the pre-aggregated data and the base data. And it is hard to extract the dimension combinations according to which the current query is performed from the SQL JOIN statement.

DQL makes the whole thing much easier. It extends the concept of fields tointroduce the general fields, creating a logical single tablethat solves the issue of correspondence between the pre-aggregated data and the base data. Logically the base data is a single table. A DQL query clearly shows the dimension combinations (general fields) according to which an aggregation is performed, making it easy to collect the most commonly dimension combinations.