SPL practice: query massive and flexible structured data
Problem description
A flexible data structure means that each record of a data table has different data structure. Usually, the fields of data table are divided into two parts, one part is the fields that are common to all records (common field for short), and the other part is the fields that vary from record to record (flexible field for short). There may be as many as hundreds of flexible fields, yet each record may have only a few of them. In complex situations, these flexible fields are further classified.
Here is a student score table, and this table does not involve classification:
Common field | Flexible field | ||||||||
Student ID | Name | Gender | Age | Chinese | English | Math | Physics | Chemistry | Biology |
1 | Student A | Male | 16 | 67 | 50 | 90 | 100 | 80 | 80 |
2 | Student B | Female | 18 | / | / | / | 99 | / | 70 |
3 | Student C | Male | 17 | 98 | 88 | 78 | / | / | / |
… | … | … | … | … | … | … | … | … | … |
Some students do not take all subjects as elective course, so they do not have score for the unelected subject, represented by a slash.
The following is a sports student score table, and this table involves classification:
Common field | Flexible field | ||||||||||
Student ID | Name | Gender | Age | Basketball | Football | ||||||
Comments | National player or not | Score | Comments | National player or not | Score | ||||||
1 | Student A | Male | 16 | The student often... | Yes | 100 | / | / | / | ||
2 | Student B | Female | 18 | / | / | / | The student occasionally... | / | 95 | ||
3 | Student C | Male | 17 | The student is good at... | Yes | 20 | The student is skillful at... | Yes | 10 | ||
… | … | … | … | … | … | … | … | … | … |
The common fields contain the basic attribute of each student. The flexible fields are classified into different attributes (basketball and football in this example), and each attribute has the same classes (comments [string], National player or not [boolean] and score [value] in this example).
Now we will focus on the complex situation involving classification. The situation without classification is a special case of complex situation.
Let’s take the following query criteria as an example:
Find out the records whose gender is “male” in the common field (primary table)
and,
find out the records whose class value “national player or not” under the attribute “basketball” is “yes” and whose class value “score” under the attribute “football” is less than 60 in the flexible field (sub table).
In essence, the data structure described in the two examples can be logically regarded as a big wide table. However, when there are too many attributes and classes, this table will have thousands of fields (common fields + all possible attributes * classes). If this table is processed into a physical wide table, it will result in too many fields, most of which are blank.
In general, SQL describes such data structure through the following two tables:
Main table (common field)
Field name | Type | Remarks |
---|---|---|
id | Value | id number (auto-incrementing) |
name | String | Name |
gender | Boolean | Gender |
age | Value | Age |
Sub table (flexible field)
Field name | Type | Remarks |
---|---|---|
id | Value | id number (same as main table) |
atr | Value | Attribute (enumerable, stored as number) |
comment | String | Comments (manually enter) |
nt | Boolean | National player or not |
rate | Value | Score |
In this table, the classes refer to the fields comment, nt, and rate. To query the records according to the above-mentioned query criteria, coding in SQL will be:
SELECT *
FROM main m, subs
WHERE m.id= s.id AND
(SELECT count(*)
FROM sub ss
WHERE m.id= ss.id AND (ss.atr= 1 AND ss.nt= 1) or (ss.atr=2 AND ss.rate<60) ) = 2
If this query task is handled in SPL, there are usually three methods:
I. Use json-style field
For such flexible data structure, json-style field is usually used.
When no classification is involved, the json string is a record. For example, the scores of student B in the student score table:
{
“Physics”: 99,
“Biology”: 70
}
When involving classification, it is a table sequence. For example, the scores of student C in the sports student score table:
[
{
“Attribute”: basketball,
“Comments”: the student is good at...
“National player or not”: yes
“Score”: 20
},
{
“Attribute”: football,
“Comments”: the student is skillful at...
“National player or not”: yes
“Score”: 10
}
]
Using the json-style field will increase the storage amount (as each record or table sequence needs to store the field names once), and it always needs to parse json as the record or table sequence before calculation, which will result in poor performance. It is very convenient for SPL to process json-style field, and the corresponding code is relatively conventional, so we won’t go into details here.
Now let’s look at two other methods.
II. Use sequence type field
Let’s take the aforementioned “sports student score table” as an example, using the sequence type field is to regard all attributes of the sub table as a sequence field, and the class is then converted to the corresponding sequence. See the table below for details:
Common field | Flexible field | |||||||
Student ID | Name | Gender | Age | Attribute | Comments | National player or not | Score | |
1 | Student A | Male | 16 | [Basketball] | [The student often...] | [Yes] | [100] | |
2 | Student B | Female | 18 | [Football] | [The student occasionally...] | [null] | [95] | |
3 | Student C | Male | 17 | [Basketball, football] | [The student is good at..., and skillful at...] | [Yes, Yes] | [20,10] | |
… | … | … | … | … | … | … | … |
2.1 Data structure
Define the test data structure as follows:
Main table (common field)
ield name | Type | Remarks |
---|---|---|
id | Value | Student ID (auto-incrementing) |
name | String | Name |
gender | Boolean | Gender |
age | Value | Age |
class | Value | Class (enumerable, stored as number) |
Sub table (flexible fields)
Field name | Type | Remarks |
---|---|---|
id | Value | Student ID (same as main table) |
atr | Value | Attribute (enumerable, stored as number) |
comment | String | Comments (manually enter) |
nt | Boolean | National player or not |
rate | Value | Score |
2.2 Move the data from database and store
The code for taking the data from database:
A | |
---|---|
1 | =connect@l(“oracle12c”) |
2 | =A1.cursor(“SELECT * FROM MAIN ORDER BY ID”) |
3 | =A1.cursor@x(“SELECT * FROM SUB ORDER BY ID”) |
4 | =A3.group(id;~.(atr):atr,~.(comment): comment,~.(nt): nt,~.(rate): rate) |
5 | =joinx(A2:main,id;A4:sub,id) |
6 | =A5.new(main.id,main.name,main.gender,main.age,main.class,sub.atr,sub.comment,sub.nt,sub.rate) |
7 | =file(“seq_all.ctx”).create(#id,name,gender,age,class,atr,comment,nt,rate) |
8 | >A7.append@i(A6) |
2.3 Generate test data
For the convenience of testing, we write the following SPL code to directly generate a composite table file seq_all.ctx:
A | B | |
---|---|---|
1 | >rand@s(1) | |
2 | >n=100000 | |
3 | =file(“seq_all.ctx”).create(#id,name,gender,age,class,atr,comment,nt,rate) | |
4 | for 30 | =to(n*(A4-1)+1,n*A4).new(~:id,“name”/~:name,rand(2)+1:gender,rand(17)+18:age,rand(10)+1:class,20.sort(rand()).to(rand(5)+5):atr,atr.len().(rands(“qwertyuiopasdfghjklzxcvbnm”,rand(200)+2)):comment,atr.len().([true,false,null](rand(3)+1)):nt,atr.len().(rand(100)+1):rate) |
5 | >A3.append@i(B4.cursor()) |
2.4 Query
Case 1: the age is greater than or equal to 20 and less than or equal to 25, and the classes include 1, 3, 6
Case 2: the class value “national player or not” under the attribute “basketball” is “yes”, and the class value “score” under the attribute “football” is less than 60
A | |
---|---|
1 | /case 1: age is greater than or equal to 20 and less than or equal to 25, and the classes include 1, 3, 6 |
2 | [1,3,6] |
3 | age>=20 && age<=25 && A2.pos(class) |
4 | /case 2: the class value “national player or not” under the attribute “basketball” is “yes”, and the class value “score” under the attribute “football” is less than 60 |
5 | j(atr,nt,rate).count((#1==1 && #2==true) || (#1==2 && #3<60))==2 |
6 | /Main table fields |
7 | id,name,gender,age,class |
8 | /Sub table fields |
9 | atr,comment,nt,rate |
10 | /case1_query |
11 | =now() |
12 | =file(“seq_all.ctx”).open() |
13 | =A12.cursor@v(${A7},${A9};${A3}) |
14 | =A13.fetch(5000) |
15 | =interval@ms(A11,now()) |
16 | /case2_query |
17 | =now() |
18 | =file(“seq_all.ctx”).open() |
19 | =A18.cursor@v(${A7},${A9};${A5}) |
20 | =A19.fetch(5000) |
21 | =interval@ms(A17,now()) |
A5 is a conditional expression where j is to join multiple sequence fields into a table sequence, and then count the number of each record that satisfies the condition. If the number is equal to the one in the condition, it means that this record needs to be taken.
III. Use attached table
The data structure in 2.1 can be regarded as the primary-sub table relationship. When two tables are in primary-sub relationship, we can regard the common field part as a base table, and the flexible field part as an attached table. For more information about attached table, visit: SPL Attached Table
3.1 Move the data from database and store
A | |
---|---|
1 | =connect@l(“oracle12c”) |
2 | =A1.cursor(“SELECT * FROM MAIN ORDER BY ID”) |
3 | =A1.cursor@x(“SELECT * FROM SUB ORDER BY ID”) |
4 | =file(“attach_all.ctx”).create(#id,name,gender,age,class) |
5 | =A4.attach(sub,atr,comment,nt,rate) |
6 | =A4.append@i(A2) |
7 | =A5.append@i(A3) |
3.2 Generate test data
The script for generating the attached table data based on the file generated in 2.3:
A | |
---|---|
1 | =file(“seq_all.ctx”).open().cursor(id,name,gender,age,class) |
2 | =file(“seq_all.ctx”).open().cursor(id,atr,comment,nt,rate).news(#2;id,~:atr,comment(#):comment,nt(#):nt,rate(#):rate) |
3 | =file(“attach_all.ctx”).create(#id,name,gender,age,class) |
4 | =A3.attach(sub,atr,comment,nt,rate) |
5 | =A3.append@i(A1) |
6 | =A4.append@i(A2) |
3.3 Query
The query criteria are the same as that described in 2.4.
A | |
---|---|
1 | /case1: age is greater than or equal to 20 and less than or equal to 25, and the classes include 1, 3, 6 |
2 | [1,3,6] |
3 | age>=20 && age<=25 && A2.pos(class) |
4 | /case 2: the class value “national player or not” under the attribute “basketball” is “yes”, and the class value “score” under the attribute “football” is less than 60 |
5 | sub.count((atr==1 && nt==true) || (atr==2 && rate<60))==2 |
6 | /Main table fields |
7 | id,name,gender,age,class |
8 | /Sub table fields |
9 | atr,comment,nt,rate |
10 | /case1_query |
11 | =now() |
12 | =file(“attach_all.ctx”).open() |
13 | =A12.cursor@v(${A7},sub{${A9}}:sub;${A3}) |
14 | =A13.fetch(5000) |
15 | =interval@ms(A11,now()) |
16 | /case2_query |
17 | =now() |
18 | =file(“attach_all.ctx”).open() |
19 | =A18.cursor@v(${A7},sub{${A9}}:sub) |
20 | =A19.select@v(${A5}) |
21 | =A20.fetch(5000) |
22 | =interval@ms(A17,now()) |
3.4 Search twice
When using the sequence type field, we can utilize the pre-cursor filtering technology to improve performance. However, the conditions on the sub table in attached table require reading all columns first, though some fields are unwanted. If few results are obtained by conditional filtering, it means many useless fields will be read.
Therefore, when using attached table, it is likely to be faster to first take the primary key by the condition-related fields before searching. In contrast, for the sequence type field, it is unnecessary to process this way.
First, find out the primary key sequence of the records that meet conditions, and then search for the results through this sequence. SPL script:
A | |
---|---|
1 | /case1: age is greater than or equal to 20 and less than or equal to 25, and the classes include 1, 3, 6 |
2 | [1,3,6] |
3 | age>=20 && age<=25 && A2.pos(class) |
4 | /case 2: the class value “national player or not” under the attribute “basketball” is “yes”, and the class value “score” under the attribute “football” is less than 60 |
5 | sub.count((atr==1 && nt==true) || (atr==2 && rate<60))==2 |
6 | /Main table fields |
7 | id,name,gender,age,class |
8 | /Sub table fields |
9 | atr,comment,nt,rate |
10 | /case1_query |
11 | id |
12 | atr,nt,rate |
13 | /keys |
14 | =now() |
15 | =file(“attach_all.ctx”).open() |
16 | =A15.cursor@v(${A11},sub{${A12}}:sub;${A3}) |
17 | =A16.fetch(5000) |
18 | =interval@ms(A14,now()) |
19 | /find by keys |
20 | =now() |
21 | =file(“attach_all.ctx”).open() |
22 | =A21.cursor@v(${A7},sub{${A9}}:sub;A17.find(id)) |
23 | =A22.fetch(5000) |
24 | =interval@ms(A20,now()) |
25 | /case2_query |
26 | =now() |
27 | =file(“attach_all.ctx”).open() |
28 | =A27.cursor@v(${A11},sub{${A12}}:sub) |
29 | =A28.select@v(${A5}) |
30 | =A29.fetch(5000) |
31 | =interval@ms(A26,now()) |
32 | /find by keys |
33 | =now() |
34 | =file(“attach_all.ctx”).open() |
35 | =A34.cursor@v(${A7},sub{${A9}}:sub;A30.find(id)) |
36 | =A35.fetch(5000) |
37 | =interval@ms(A33,now()) |
IV. Use row-based storage
SPL’s columnar storage adopts the data block and compression algorithm, leading to less access data volume and better performance as for the traversing operation. But for scenarios of index-oriented random data fetching, the complexity is much bigger due to the extra decompression process and the fact that each fetching is performed on the whole block. Therefore, the performance of columnar storage would be worse than that of row-based storage in principle.
For the same data, it is impossible to achieve optimal performance in both traversal operation and random data fetching. To obtain ultimate performance, we can store a redundant row-based storage composite table file, and establish an index for the primary keys of this composite table.
The script for converting to row-based storage and creating an index is as follows:
A | |
---|---|
1 | /col2row |
2 | =file(“seq_all.ctx”) |
3 | =file(“seq_all_row.ctx”) |
4 | =A2.reset@r(A3) |
5 | /create_index |
6 | =A3.open().index(idx;id) |
First, find out the primary key in the way of attached table, and then search for the results in the row-based storage index through the primary key. The script is as follows:
A | |
---|---|
1 | /case1: age is greater than or equal to 20 and less than or equal to 25, and the classes include 1, 3, 6 |
2 | [1,3,6] |
3 | age>=20 && age<=25 && A2.pos(class) |
4 | /case 2: the class value “national player or not” under the attribute “basketball” is “yes”, and the class value “score” under the attribute “football” is less than 60 |
5 | sub.count((atr==1 && nt==true) || (atr==2 && rate<60))==2 |
6 | /Main table fields |
7 | id,name,gender,age,class |
8 | /Sub table fields |
9 | atr,comment,nt,rate |
10 | /case1_query |
11 | id |
12 | atr,nt,rate |
13 | /keys |
14 | =now() |
15 | =file(“attach_all.ctx”).open() |
16 | =A15.cursor@v(${A11},sub{${A12}}:sub;${A3}) |
17 | =A16.fetch(5000).(#1) |
18 | =interval@ms(A14,now()) |
19 | /find by keys |
20 | =now() |
21 | =file(“seq_all_row.ctx”).open() |
22 | =A21.icursor(${A7},${A9};A17.contain(id)) |
23 | =A22.fetch(5000) |
24 | =interval@ms(A20,now()) |
25 | /case2_query |
26 | =now() |
27 | =file(“attach_all.ctx”).open() |
28 | =A27.cursor@v(${A11},sub{${A12}}:sub) |
29 | =A28.select@v(${A5}) |
30 | =A29.fetch(5000).(#1) |
31 | =interval@ms(A26,now()) |
32 | /find by keys |
33 | =now() |
34 | =file(“seq_all_row.ctx”).open() |
35 | =A34.icursor(${A7},${A9};A30.contain(id)) |
36 | =A35.fetch(5000) |
37 | =interval@ms(A33,now()) |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProc_SPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version