SPL practice: query massive and flexible structured data

 

Problem description

A flexible data structure means that each record of a data table has different data structure. Usually, the fields of data table are divided into two parts, one part is the fields that are common to all records (common field for short), and the other part is the fields that vary from record to record (flexible field for short). There may be as many as hundreds of flexible fields, yet each record may have only a few of them. In complex situations, these flexible fields are further classified.

Here is a student score table, and this table does not involve classification:

Common field Flexible field
Student ID Name Gender Age Chinese English Math Physics Chemistry Biology
1 Student A Male 16 67 50 90 100 80 80
2 Student B Female 18 / / / 99 / 70
3 Student C Male 17 98 88 78 / / /










Some students do not take all subjects as elective course, so they do not have score for the unelected subject, represented by a slash.

The following is a sports student score table, and this table involves classification:

Common field Flexible field
Student ID Name Gender Age Basketball Football
Comments National player or not Score Comments National player or not Score
1 Student A Male 16 The student often... Yes 100 / / /
2 Student B Female 18 / / / The student occasionally... / 95
3 Student C Male 17 The student is good at... Yes 20 The student is skillful at... Yes 10

The common fields contain the basic attribute of each student. The flexible fields are classified into different attributes (basketball and football in this example), and each attribute has the same classes (comments [string], National player or not [boolean] and score [value] in this example).

Now we will focus on the complex situation involving classification. The situation without classification is a special case of complex situation.

Let’s take the following query criteria as an example:

Find out the records whose gender is “male” in the common field (primary table)

and,

find out the records whose class value “national player or not” under the attribute “basketball” is “yes” and whose class value “score” under the attribute “football” is less than 60 in the flexible field (sub table).

In essence, the data structure described in the two examples can be logically regarded as a big wide table. However, when there are too many attributes and classes, this table will have thousands of fields (common fields + all possible attributes * classes). If this table is processed into a physical wide table, it will result in too many fields, most of which are blank.

In general, SQL describes such data structure through the following two tables:

Main table (common field)

Field name Type Remarks
id Value id number (auto-incrementing)
name String Name
gender Boolean Gender
age Value Age

Sub table (flexible field)

Field name Type Remarks
id Value id number (same as main table)
atr Value Attribute (enumerable, stored as number)
comment String Comments (manually enter)
nt Boolean National player or not
rate Value Score

In this table, the classes refer to the fields comment, nt, and rate. To query the records according to the above-mentioned query criteria, coding in SQL will be:

SELECT *
FROM main m, subs
WHERE m.id= s.id AND
    (SELECT count(*)
    FROM sub ss
WHERE m.id= ss.id AND (ss.atr= 1 AND ss.nt= 1) or (ss.atr=2 AND ss.rate<60) ) = 2

If this query task is handled in SPL, there are usually three methods:

I. Use json-style field

For such flexible data structure, json-style field is usually used.

When no classification is involved, the json string is a record. For example, the scores of student B in the student score table:

{
“Physics”: 99,
“Biology”: 70
}
When involving classification, it is a table sequence. For example, the scores of student C in the sports student score table:
[
{
“Attribute”: basketball,
“Comments”: the student is good at...
“National player or not”: yes
“Score”: 20
},
{
“Attribute”: football, 
“Comments”: the student is skillful at...
“National player or not”: yes
“Score”: 10
}
]

Using the json-style field will increase the storage amount (as each record or table sequence needs to store the field names once), and it always needs to parse json as the record or table sequence before calculation, which will result in poor performance. It is very convenient for SPL to process json-style field, and the corresponding code is relatively conventional, so we won’t go into details here.

Now let’s look at two other methods.

II. Use sequence type field

Let’s take the aforementioned “sports student score table” as an example, using the sequence type field is to regard all attributes of the sub table as a sequence field, and the class is then converted to the corresponding sequence. See the table below for details:

Common field Flexible field
Student ID Name Gender Age Attribute Comments National player or not Score
1 Student A Male 16 [Basketball] [The student often...] [Yes] [100]
2 Student B Female 18 [Football] [The student occasionally...] [null] [95]
3 Student C Male 17 [Basketball, football] [The student is good at..., and skillful at...] [Yes, Yes] [20,10]

2.1 Data structure

Define the test data structure as follows:

Main table (common field)

ield name Type Remarks
id Value Student ID (auto-incrementing)
name String Name
gender Boolean Gender
age Value Age
class Value Class (enumerable, stored as number)

Sub table (flexible fields)

Field name Type Remarks
id Value Student ID (same as main table)
atr Value Attribute (enumerable, stored as number)
comment String Comments (manually enter)
nt Boolean National player or not
rate Value Score

2.2 Move the data from database and store

The code for taking the data from database:

A
1 =connect@l(“oracle12c”)
2 =A1.cursor(“SELECT * FROM MAIN ORDER BY ID”)
3 =A1.cursor@x(“SELECT * FROM SUB ORDER BY ID”)
4 =A3.group(id;~.(atr):atr,~.(comment): comment,~.(nt): nt,~.(rate): rate)
5 =joinx(A2:main,id;A4:sub,id)
6 =A5.new(main.id,main.name,main.gender,main.age,main.class,sub.atr,sub.comment,sub.nt,sub.rate)
7 =file(“seq_all.ctx”).create(#id,name,gender,age,class,atr,comment,nt,rate)
8 >A7.append@i(A6)

2.3 Generate test data

For the convenience of testing, we write the following SPL code to directly generate a composite table file seq_all.ctx:

A B
1 >rand@s(1)
2 >n=100000
3 =file(“seq_all.ctx”).create(#id,name,gender,age,class,atr,comment,nt,rate)
4 for 30 =to(n*(A4-1)+1,n*A4).new(~:id,“name”/~:name,rand(2)+1:gender,rand(17)+18:age,rand(10)+1:class,20.sort(rand()).to(rand(5)+5):atr,atr.len().(rands(“qwertyuiopasdfghjklzxcvbnm”,rand(200)+2)):comment,atr.len().([true,false,null](rand(3)+1)):nt,atr.len().(rand(100)+1):rate)
5 >A3.append@i(B4.cursor())

2.4 Query

Case 1: the age is greater than or equal to 20 and less than or equal to 25, and the classes include 1, 3, 6

Case 2: the class value “national player or not” under the attribute “basketball” is “yes”, and the class value “score” under the attribute “football” is less than 60

A
1 /case 1: age is greater than or equal to 20 and less than or equal to 25, and the classes include 1, 3, 6
2 [1,3,6]
3 age>=20 && age<=25 && A2.pos(class)
4 /case 2: the class value “national player or not” under the attribute “basketball” is “yes”, and the class value “score” under the attribute “football” is less than 60
5 j(atr,nt,rate).count((#1==1 && #2==true) || (#1==2 && #3<60))==2
6 /Main table fields
7 id,name,gender,age,class
8 /Sub table fields
9 atr,comment,nt,rate
10 /case1_query
11 =now()
12 =file(“seq_all.ctx”).open()
13 =A12.cursor@v(${A7},${A9};${A3})
14 =A13.fetch(5000)
15 =interval@ms(A11,now())
16 /case2_query
17 =now()
18 =file(“seq_all.ctx”).open()
19 =A18.cursor@v(${A7},${A9};${A5})
20 =A19.fetch(5000)
21 =interval@ms(A17,now())

A5 is a conditional expression where j is to join multiple sequence fields into a table sequence, and then count the number of each record that satisfies the condition. If the number is equal to the one in the condition, it means that this record needs to be taken.

III. Use attached table

The data structure in 2.1 can be regarded as the primary-sub table relationship. When two tables are in primary-sub relationship, we can regard the common field part as a base table, and the flexible field part as an attached table. For more information about attached table, visit: SPL Attached Table

3.1 Move the data from database and store

A
1 =connect@l(“oracle12c”)
2 =A1.cursor(“SELECT * FROM MAIN ORDER BY ID”)
3 =A1.cursor@x(“SELECT * FROM SUB ORDER BY ID”)
4 =file(“attach_all.ctx”).create(#id,name,gender,age,class)
5 =A4.attach(sub,atr,comment,nt,rate)
6 =A4.append@i(A2)
7 =A5.append@i(A3)

3.2 Generate test data

The script for generating the attached table data based on the file generated in 2.3:

A
1 =file(“seq_all.ctx”).open().cursor(id,name,gender,age,class)
2 =file(“seq_all.ctx”).open().cursor(id,atr,comment,nt,rate).news(#2;id,~:atr,comment(#):comment,nt(#):nt,rate(#):rate)
3 =file(“attach_all.ctx”).create(#id,name,gender,age,class)
4 =A3.attach(sub,atr,comment,nt,rate)
5 =A3.append@i(A1)
6 =A4.append@i(A2)

3.3 Query

The query criteria are the same as that described in 2.4.

A
1 /case1: age is greater than or equal to 20 and less than or equal to 25, and the classes include 1, 3, 6
2 [1,3,6]
3 age>=20 && age<=25 && A2.pos(class)
4 /case 2: the class value “national player or not” under the attribute “basketball” is “yes”, and the class value “score” under the attribute “football” is less than 60
5 sub.count((atr==1 && nt==true) || (atr==2 && rate<60))==2
6 /Main table fields
7 id,name,gender,age,class
8 /Sub table fields
9 atr,comment,nt,rate
10 /case1_query
11 =now()
12 =file(“attach_all.ctx”).open()
13 =A12.cursor@v(${A7},sub{${A9}}:sub;${A3})
14 =A13.fetch(5000)
15 =interval@ms(A11,now())
16 /case2_query
17 =now()
18 =file(“attach_all.ctx”).open()
19 =A18.cursor@v(${A7},sub{${A9}}:sub)
20 =A19.select@v(${A5})
21 =A20.fetch(5000)
22 =interval@ms(A17,now())

3.4 Search twice

When using the sequence type field, we can utilize the pre-cursor filtering technology to improve performance. However, the conditions on the sub table in attached table require reading all columns first, though some fields are unwanted. If few results are obtained by conditional filtering, it means many useless fields will be read.

Therefore, when using attached table, it is likely to be faster to first take the primary key by the condition-related fields before searching. In contrast, for the sequence type field, it is unnecessary to process this way.

First, find out the primary key sequence of the records that meet conditions, and then search for the results through this sequence. SPL script:

A
1 /case1: age is greater than or equal to 20 and less than or equal to 25, and the classes include 1, 3, 6
2 [1,3,6]
3 age>=20 && age<=25 && A2.pos(class)
4 /case 2: the class value “national player or not” under the attribute “basketball” is “yes”, and the class value “score” under the attribute “football” is less than 60
5 sub.count((atr==1 && nt==true) || (atr==2 && rate<60))==2
6 /Main table fields
7 id,name,gender,age,class
8 /Sub table fields
9 atr,comment,nt,rate
10 /case1_query
11 id
12 atr,nt,rate
13 /keys
14 =now()
15 =file(“attach_all.ctx”).open()
16 =A15.cursor@v(${A11},sub{${A12}}:sub;${A3})
17 =A16.fetch(5000)
18 =interval@ms(A14,now())
19 /find by keys
20 =now()
21 =file(“attach_all.ctx”).open()
22 =A21.cursor@v(${A7},sub{${A9}}:sub;A17.find(id))
23 =A22.fetch(5000)
24 =interval@ms(A20,now())
25 /case2_query
26 =now()
27 =file(“attach_all.ctx”).open()
28 =A27.cursor@v(${A11},sub{${A12}}:sub)
29 =A28.select@v(${A5})
30 =A29.fetch(5000)
31 =interval@ms(A26,now())
32 /find by keys
33 =now()
34 =file(“attach_all.ctx”).open()
35 =A34.cursor@v(${A7},sub{${A9}}:sub;A30.find(id))
36 =A35.fetch(5000)
37 =interval@ms(A33,now())

IV. Use row-based storage

SPL’s columnar storage adopts the data block and compression algorithm, leading to less access data volume and better performance as for the traversing operation. But for scenarios of index-oriented random data fetching, the complexity is much bigger due to the extra decompression process and the fact that each fetching is performed on the whole block. Therefore, the performance of columnar storage would be worse than that of row-based storage in principle.

For the same data, it is impossible to achieve optimal performance in both traversal operation and random data fetching. To obtain ultimate performance, we can store a redundant row-based storage composite table file, and establish an index for the primary keys of this composite table.

The script for converting to row-based storage and creating an index is as follows:

A
1 /col2row
2 =file(“seq_all.ctx”)
3 =file(“seq_all_row.ctx”)
4 =A2.reset@r(A3)
5 /create_index
6 =A3.open().index(idx;id)

First, find out the primary key in the way of attached table, and then search for the results in the row-based storage index through the primary key. The script is as follows:

A
1 /case1: age is greater than or equal to 20 and less than or equal to 25, and the classes include 1, 3, 6
2 [1,3,6]
3 age>=20 && age<=25 && A2.pos(class)
4 /case 2: the class value “national player or not” under the attribute “basketball” is “yes”, and the class value “score” under the attribute “football” is less than 60
5 sub.count((atr==1 && nt==true) || (atr==2 && rate<60))==2
6 /Main table fields
7 id,name,gender,age,class
8 /Sub table fields
9 atr,comment,nt,rate
10 /case1_query
11 id
12 atr,nt,rate
13 /keys
14 =now()
15 =file(“attach_all.ctx”).open()
16 =A15.cursor@v(${A11},sub{${A12}}:sub;${A3})
17 =A16.fetch(5000).(#1)
18 =interval@ms(A14,now())
19 /find by keys
20 =now()
21 =file(“seq_all_row.ctx”).open()
22 =A21.icursor(${A7},${A9};A17.contain(id))
23 =A22.fetch(5000)
24 =interval@ms(A20,now())
25 /case2_query
26 =now()
27 =file(“attach_all.ctx”).open()
28 =A27.cursor@v(${A11},sub{${A12}}:sub)
29 =A28.select@v(${A5})
30 =A29.fetch(5000).(#1)
31 =interval@ms(A26,now())
32 /find by keys
33 =now()
34 =file(“seq_all_row.ctx”).open()
35 =A34.icursor(${A7},${A9};A30.contain(id))
36 =A35.fetch(5000)
37 =interval@ms(A33,now())