SPL Pseudo Table Data Type Optimization

 

A table storing data physically (which is simply called physical table) usually uses some storage mechanisms to boost performance and reduce space usage. Yet these mechanisms will increase coding complexity while pushing up performance.

Suppose we have detail table, customer table and product table. Below shows their structures and relationships:

..

Below shows data in each of the three tables:

..

The computing task is like this: get information according to the condition that the device the current exchange uses is smartphone or tablet, and that the current customer comes from California, Texas or Illinois, and that the current product is not Stewing Beef; then group records by state, the year/month in ddate, device and product type, and calculate total amount and number of transactions in each group. Values in the result set should be displayed in easy-to-read strings.

Let’s look at the storage mechanism each physical table uses and analyze how much impact they have on increasing the computing performance.

First, physical table detail does not store any customer- or product-related fields in order to avoid data redundancy and reduce disk usage. We need to join the three tables through foreign keys before performing subsequent computations. The table also does not store transaction year and month. We need to calculate the year and month using the expression according to the transaction date. A field calculated based on an existing field, like the year/month field, is called calculated field.

Second, detail table uses integers 1, 2 and 3 to represent different devices – computer, smartphone and tablet respectively– used in the current record in its enumerated dimension device because integers occupy small space and are fast to compute.

Third, detail table stores many bool dimensions, such as isOversea (whether the product is purchased overseas) and isCreditCard (whether the product is paid by credit card). If we want to store every such dimension as a separate field, a lot of integer type binary fields are needed even if we use binary dimensions (values are 1 or 0) instead of bool dimensions. But actually, the table only uses a short 16-bit integer type bools field to successfully store 16 bool dimensions. Each binary bit stores one bool dimension. We call such a field bit-based dimension. A bit-based dimension helps minimize storage space usage, as well as being able to calculate 16 bool values at once using binary AND/OR operation. Performance is thus considerably improved.

If we directly use the three physical tables to achieve the computing logic, we need to write a script to calculate year and month from ddate, convert enumerated dimension device, associate detail table’s cid field with customer table’s foreign key, and relate detail table’s pgid field and pid field with product table’s foreign key. The degree of complexity is high and amount of code is large. The logic could also involve bool dimensions, such as the requirement that detail transaction information should be overseas purchase and make IsCreditCard field another grouping field. Moreover, we have to implement conversion from bools field’s bit values to boolean dimensions. This further complexes the coding process.

An SPL pseudo table is a logical table defined based on a physical data table. It supports defining user-defined fields according to real fields, which enables to encapsulate various storage structures, including calculated field, enumerated dimension, bit-based dimension and pre-defined foreign key, into one place. This makes storage mechanisms used in a physical table transparent.

Using SPL pseudo tables helps store data conveniently so that we can have high computing performance and low disk utilization, and makes data convenient to manipulate by reducing coding complexity.

Let’s take a look at the working principles behind data type conversion through pseudo tables.

Calculated field

To group detail table by year/month (yyyymm) and calculate the total amount and total transactions in each group, the ddate field cannot be directly used. Instead, we need to calculate the transaction year/month (yyyymm) based on the ddate field. Such calculated fields are many, like year, year/month and week. All can be calculated according to a date field.

Though we can add several fields to a physical table to store the date data and these added fields can have little effect on computing performance when columnar storage is used, it takes time to generate them and they increase disk space usage. What’s more, some data cannot be pre-stored, such as customer ages, which need to be calculated using birth dates and the current date.

We can define a calculated field in a pseudo table. Values of a calculated field do not exist in the physical table but are calculated from a real field. We can define year/month and age as calculated fields in a pseudo table so as to avoid generating data for the physical table and save up disk space. A calculated field in a pseudo table can be manipulated as a regular field. That’s convenient.

For example, we create pseudo table p_detail based on the physical table detail, and define a user-defined field dmonth for generating a calculated field storing transaction year/month, as the following shows:

..

detail table does not have a dmonth filed. The field needs to be calculated through an expression exp, which is month@y(ddate) here.

SPL uses a multi-attribute structure to define a pseudo table. Each attribute has a fixed name. In this instance, pseudo table p_detail’s definition contains physical table name, calculated field name and expression exp, as shown below:

..

Attribute file is the physical table name based on which the pseudo table is generated. Its value is composite table "detail.ctx". Attribute column commands a set of user-defined fields. Each row of its value defines a user-defined field. In such a row, name column has calculated field name dmonth and exp column has expression month@y(ddate).

A pseudo table can also be generated from another type of external table (a bin file, a multi-zone composite table file or a cluster composite table file) or table stored in the memory (a table sequence, an in-memory table or a cluster in-memory table).

Below is the SPL code for defining pseudo table p_detail:


A

1

=[{file:"detail.ctx",

column:[

{name:"dmonth",exp:"month@y(ddate)"}

]

}]

2

>p_detail=pseudo(A1)

A1 generates a pseudo table definition. A2 generates pseudo table p_detail.

After a pseudo table is generated, we use a cursor (p_detail.cursor().fetch(6)) to retrieve some data from it, as shown below:

..

Besides retaining the real fields in the physical table, the pseudo table also adds a calculated field dmonth. Programmers can regard p_detail as a simple data table and directly use demonth field to perform grouping and aggregation:


3

=p_detail.groups(dmonth;sum(amt):sum,count(~):count)

A3 automatically transforms the pseudo table field dmonth to real field ddate’s expression according to the above pseudo table definition. Below is part of the result set returned after code is executed:

..

Enumerated dimension

The enumerated dimension field device in physical table detail stores numeric values 1,2 and 3. Numeric values are fast to compute but they are not easy to understand and use as strings are. If we store both strings and numeric values, we can harvest good performance and ease of use. But problems are still there – increase of both data generation time and disk space usage. On certain occasions, the inconsistency between strings and numeric values may occur.

Similarly, we can define a pseudo field in the pseudo table to address this issue. Only store values in the physical table, and define the relationship between numeric values of the enumerated dimension and the strings through a pseudo field in the pseudo table. This way we can directly use strings when performing computations on the pseudo table and get strings in the computing result. The physical table is only responsible for storing numeric values to ensure that computing performance is high.

In the following screenshot, pseudo table p_detail defines a new pseudo field deviceString to make values of the enumerated dimension transparent:

..

Physical table detail does not have a deviceString field. We need to generate one according to enumerated set enum. In our example, enum values is ["computer","smart phone","pad"], where the three enumerated members correspond to natural number 1, 2 and 3.

Now the pseudo table definition becomes this:

..

Another pseudo field definition (deviceString) is added under column. The definition includes the above-mentioned enum.

Here is the updated SPL code for defining pseudo table p_detail:


A

1

=[{file:"detail.ctx",

column:[

{name:"dmonth",exp:"month@y(ddate)"},

{name:"device",pseudo:"deviceString",enum:["computer","smart phone","pad"]}

]

}]

2

>p_detail=pseudo(A1)

A1 adds an enumerated type pseudo field deviceString to the pseudo table definition.

Below is part of the data retrieved from the pseudo table:

..

In the computation, p_detail is still treated as a simple, regular table and deviceString is directly used for performing conditional filtering and grouping, and strings are displayed in the result set:


3

=p_detail.select(deviceString=="pad" || deviceString=="smart phone")

4

=A3.groups(dmonth,deviceString;sum(amt),count(~))

In A3, strings smart phone and pad in the filtering condition are frist automatically converted into 2 and 3 respectively, and then they are used to filter data in the physical table detail.

In A4’s result set, device field values 2 and 3 are transformed back to "smart phone" and "pad" respectively, as shown below:

..

Bit-based dimension

The bools field in detail table is a bit-based dimension that stores two binary dimensions isOversea (whether it is an overseas purchase) and isCreditCard (whether the purchase is paid by credit card). Each dimension occupies one binary bit of a bools field value. In this way, a 16-bit short integer type field is able to store 16 boolean dimensions. As we said, storing data in a bit-based dimension can not only decrease storage space usage but accomplish computation of as many as 16 binary values at one time using binary AND/OR operation. A bit-based dimension, however, is not as easy to read and use as an independent boolean dimension or binary dimension.

Now we can define a pseudo field in the pseudo table and make conversion from bools field values to the boolean dimension transparent, as shown below:

..

And pseudo table definition becomes this:

..

A user-defined field definition is added under column. In the definition, name value is real field name bools in the physical table, and attribute bits contains a set of pseudo field names – there can be 16 binary field names at most. Here bits value includes two field names, which correspond to the first binary bit and the second binary bit in a bools field value.

The SPL code for defining pseudo table p_detail is changed accordingly:


A

B

1

=[{file:"detail.ctx",

column:[

{name:"dmonth",exp:"month@y(ddate)"},

{name:"device",pseudo:"deviceString",enum:["computer","smart phone","pad"]},

{name:"bools",bits:["isOversea","isCreditCard"]}

]

}]

2

>p_detail=pseudo(A1)

A1 adds a user-defined field bools. Attribute bits defines two pseudo field names.

Now part of data retrieved from pseudo table p_detail is as follows:

..

In subsequent computation, we directly use isOversea and isCreditCard to perform conditional filtering and grouping operation, and display Boolean values in the result set:


3

=p_detail.select(isOversea && (deviceString=="pad" || deviceString=="smart phone"))

4

=A3.groups(dmonth,deviceString,isCreditCard;sum(amt),count(~))

In A3, isOversea in the filtering condition will be automatically converted into the first bit of the bools value to filter physical table detail.

In A4’s result set, value of the second bit under bools field will be transformed to boolean value isCreditCard:

..

Pre-defined foreign key

Often a physical table needs to associate with another table through certain one or more certain code fields. For instance, detail table associates with customer table’s primary key id through its customer id (cid), and associates with product table through its pgid field and pid field. The target of associations is to make fields in three tables engage in a same computation. For instance, we want to add filtering conditions in the filtering computation at the beginning of the essay – the state customers are based is California, Texas or Illinois, and product is not Stewing Beef, and add two more grouping fields - state (state customers are based) and type (type of product).

To do this, we define a user-defined field in the pseudo table to make relationships between detail table, customer table and product table obvious, as shown below:

.. In the above screenshot, cid is defined as a user-defined field, its attribute dim defines a foreign key relationship between physical tables detail and customer. Here dim value is customer table whose primary key is id.

Since detail table and product table are associated through two fields, we add a user-defined field product to represent their relationship. This field does not exist in the physical tables. We call it referencing field.

So, the pseudo table definition is again changed:

..

Two more user-defined definitions are added under column – they are cid and product. As there isn’t a referencing field product in physical tables, we need a fkey attribute to define the physical table detail’s foreign key fields.

SPL code for defining pseudo table p_detail is changed, too:


A

B

1

=T("customer.btx").keys(id)

=T("product.btx").keys(gid,id)

2

=[{file:"detail.ctx",

column:[

{name:"dmonth",exp:"month@y(ddate)"},

{name:"device",pseudo:"deviceString",enum:["computer","smart phone","pad"]},

{name:"bools",bits:["isOversea","isCreditCard"]},

{name:"cid",dim:A1},

{name:"product",dim:B1,fkey:["pgid","pid"]}

]

}]

3

>p_detail=pseudo(A2)

A2 defines user-defined field cid and specifies that the corresponding dim is A1; adds a referencing field product whose dim is B1 and defines pgid and pid as physical table detail’s two foreign key fields.

Now pseudo table p_detail has the following data:

..

As the pseudo table’s cid field is already associated with customer table, we can get to the latter’s field directly through cid, like cid.name and cid.state, and reference a field in product table directly via product field:


4

=p_detail.select(product.name!="Stewing Beef" && ["California","Illinois","Texas"].pos(cid.state) &&(deviceString=="pad" || deviceString=="smart phone"))

5

=A4.groups(cid.state,dmonth,deviceString,product.type;sum(amt),count(~))

In A4, The filtering condition contains user-defined fields cid and product. SPL will perform the computation on the pre-associated physical tables.

Below is part of the data in A5’s result set:

..

Review & Summary

An SPL pseudo table amounts to meta data, which makes the complex storage mechanisms used in physical tables transparent. We use a pseudo table like a simple regular table, reducing complexity of computations. In addition, pseudo tables are able to automatically map logically simple computations as high-efficiency computations run on the high-performance storage mechanism. This ensures the best computing performance possible.

After defining calculated field, enumerated field, bit-based dimension and pre-defined foreign key in a pseudo table, users do not need to take care of storage mechanisms and computing methods anymore and just treat it as a simple, common table. Besides, pseudo tables can be used in user analysis scenarios to achieve data being ordered by two dimensions at the same time (find details in SPL Pseudo Table Bi-dimension Ordering Structure).