Extension of JOINs: The Concept of Dimension

When it comes to data analysis, the term dimension is often used. Operations over data cubes, such as drill-down, roll-up and slicing, center on dimensions. Almost all data analysts know and use this term. But when asked about its definition, one hardly can give an answer.

Generally, we refer to attributes (fields) used to categorize things as dimensions, such as area, year, product type, etc., and attributes used in aggregate operations as measures, such as sales, output, scores, etc. Dimensions can’t be used for aggregate operations, such as calculating the total area makes no sense. Measures can’t be used to categorize things, such as grouping things by sales doesn’t have business significance. Usually we judge whether an attribute is a dimension by “whether it can be used for categorization”. But this is actually a description of the nature of dimensions and cannot be used as a definition.

Here we discuss the concept of dimension based on relational databases. First briefly review the basic concepts.

There is a relational database in which there are multiple data tables. The table’s data structure consists of one or more fields, and the table’s data is made up of records.

A field or a collection of fields that meet the following restriction can be designated as the primary key of a data table: all records in the table have unique values on the field(s). In other words, a primary key value should uniquely identify a record. There may be no primary key set for a data table; but there must be only one primary key, if any. The one field or multiple fields that constitute a primary key are called primary key field(s).

By contrast, a data table can have more than one foreign key, which is made up of one field or multiple fields of this table. The foreign key values always fall in the range of primary key values of an associative table (sometimes it is the table itself), which is called the foreign-key-referenced table, shortened as foreign key table. The one field or multiple fields that constitute a foreign key are called foreign key field(s).

Now, let’s define the concept of dimension. In a relational database, a field that is a primary key field but isn’t a foreign key field is a dimension. The table where a dimension is located is a dimension table. A dimension can be represented by a primary key field in a dimension table. A field that is neither a primary key field nor a foreign key field is a measure. Obviously, a field cannot be both a dimension and a measure.

One point needs to be clarified. The primary key and the foreign key we are talking about are logical concepts, which are the ones introduced in the E-R model. Sometimes for the sake of performance, they are not really set up in the physical data structure. That case is beyond our consideration.

We can better understand the concept of dimension through examples.

Here are two homo-dimension tables:

The employee table, with the following fields:

id

name

salary

The manager table, with the following fields:

id

allowance

By specifying manager.id field as a foreign key pointing to the id field, the employee table’s primary key, the manager.id field becomes both a primary key field and a foreign key field, which means it isn’t a dimension. The employee.id field is a primary key field but not a foreign key field, so it is a dimension.

Then let’s look at the parent and child tables:

The Orders table, with the following fields:

id

customer

date

The OrderDetail table, with the following fields:

id

no

product

price

The OrderDetail.id field is both a primary key field and a foreign key field pointing to the Orders table. It isn’t a dimension. The Orders.id field is a primary key field but not a foreign key field. It is a dimension. The OrderDetail.no field is also a dimension because it is a primary key field but not a foreign key field.

The OrderDetail.no dimension is special because generally it will not have a foreign key field to reference in another table. We call an unreferenced dimension like this the solo dimension. There’s no need to list a solo dimension when performing dimension alignment operation on the query interface.

Now let’s check whether our definition of dimension is in line with the conventional notion of dimension, and offer solutions to the seemingly disagreeing instances.

For commonly seen dimensions, like area and product, there are corresponding tables in databases. Those dimensions correspond to the table’s primary keys. This is consistent with the above definition. It’s impossible for attributes like sales and production to be primary keys of tables, so they are not dimensions.

But is the date (or year) a dimension? Yes, obviously. But there isn’t a table in databases that uses date as the primary key. It appears that this contradicts our definition of dimension.

In fact, every database that uses date data types should logically have a date table. The date type fields of other data tables can be the foreign keys pointing to this date table. However, since date-related information can be calculated from the date itself (year, month, etc.) without the need to store the attributes separately, we usually do not create this date table in physical databases. So as long as we logically restore this date table, the date becomes consistent with our definition of dimension. We call a table that exists in logical but not in physical the pseudo table. It can be considered as a single-field table without records, where the only field is the primary key. A pseudo table can bear the dimensions without corresponding physical data tables. Similarly, the year and the month dimensions can also be defined by a pseudo table.

Then let’s look at the age. Sometimes this attribute is used to categorize things (grouping people by age group); other times it is used in an aggregate operation (say, calculating the average age of employees in each department). It appears that it is both a dimension and a measure. This does not conform to our definition of dimension.

In fact, we use age groups instead of age values themselves when grouping things by age. Age values are real numbers that can’t be directly used for categorization. The age is a measure, but the age group calculated from age values is a dimension. Here we introduce the concept of dimension function, which uses values of a certain field as the parameter to return values of a related dimension. The dimension function converts a measure into a dimension. With this concept, we can maintain the strict definition of dimension without contradicting the usual notions.

Here dimensions are defined by using the foreign key as the frame of reference. And the foreign key defines the JOIN relationship between two tables. In this sense, dimensions are defined by JOINs!