"[url]Problem description A flexible data structure means that each record of a data table has di .."

mars RaqForum 25 No.
1 Reply • 193 View • 8 Months ago

SPL practice: query massive and flexible structured data

Problem description

A flexible data structure means that each record of a data table has different data structure. Usually, the fields of data table are divided into two parts, one part is the fields that are common to all records (common field for short), and the other part is the fields that vary from record to record (flexible field for short). There may be as many as hundreds of flexible fields, yet each record may have only a few of them. In complex situations, these flexible fields are further classified.

Here is a student score table, and this table does not involve classification:

Common field				Flexible field
Student ID	Name	Gender	Age	Chinese	English	Math	Physics	Chemistry	Biology
1	Student A	Male	16	67	50	90	100	80	80
2	Student B	Female	18	/	/	/	99	/	70
3	Student C	Male	17	98	88	78	/	/	/
…	…	…	…	…	…	…	…	…	…

Some students do not take all subjects as elective course, so they do not have score for the unelected subject, represented by a slash.

The following is a sports student score table, and this table involves classification:

Common field				Flexible field
Student ID	Name	Gender	Age	Basketball				Football
				Comments	National player or not	Score	Comments		National player or not	Score
1	Student A	Male	16	The student often...	Yes	100	/		/	/
2	Student B	Female	18	/	/	/	The student occasionally...		/	95
3	Student C	Male	17	The student is good at...	Yes	20	The student is skillful at...		Yes	10
…	…	…	…	…	…	…	…		…	…

The common fields contain the basic attribute of each student. The flexible fields are classified into different attributes (basketball and football in this example), and each attribute has the same classes (comments [string], National player or not [boolean] and score [value] in this example).

Now we will focus on the complex situation involving classification. The situation without classification is a special case of complex situation.

Let’s take the following query criteria as an example:

Find out the records whose gender is “male” in the common field (primary table)

and,

find out the records whose class value “national player or not” under the attribute “basketball” is “yes” and whose class value “score” under the attribute “football” is less than 60 in the flexible field (sub table).

In essence, the data structure described in the two examples can be logically regarded as a big wide table. However, when there are too many attributes and classes, this table will have thousands of fields (common fields + all possible attributes * classes). If this table is processed into a physical wide table, it will result in too many fields, most of which are blank.

In general, SQL describes such data structure through the following two tables:

Main table (common field)

Field name	Type	Remarks
id	Value	id number (auto-incrementing)
name	String	Name
gender	Boolean	Gender
age	Value	Age

Sub table (flexible field)

Field name	Type	Remarks
id	Value	id number (same as main table)
atr	Value	Attribute (enumerable, stored as number)
comment	String	Comments (manually enter)
nt	Boolean	National player or not
rate	Value	Score

In this table, the classes refer to the fields comment, nt, and rate. To query the records according to the above-mentioned query criteria, coding in SQL will be:

SELECT *
FROM main m, subs
WHERE m.id= s.id AND
    (SELECT count(*)
    FROM sub ss
WHERE m.id= ss.id AND (ss.atr= 1 AND ss.nt= 1) or (ss.atr=2 AND ss.rate&lt;60) ) = 2

If this query task is handled in SPL, there are usually three methods:

I. Use json-style field

For such flexible data structure, json-style field is usually used.

When no classification is involved, the json string is a record. For example, the scores of student B in the student score table:

{
“Physics”: 99,
“Biology”: 70
}
When involving classification, it is a table sequence. For example, the scores of student C in the sports student score table:
[
{
“Attribute”: basketball,
“Comments”: the student is good at...
“National player or not”: yes
“Score”: 20
},
{
“Attribute”: football, 
“Comments”: the student is skillful at...
“National player or not”: yes
“Score”: 10
}
]

Using the json-style field will increase the storage amount (as each record or table sequence needs to store the field names once), and it always needs to parse json as the record or table sequence before calculation, which will result in poor performance. It is very convenient for SPL to process json-style field, and the corresponding code is relatively conventional, so we won’t go into details here.

Now let’s look at two other methods.

II. Use sequence type field

Let’s take the aforementioned “sports student score table” as an example, using the sequence type field is to regard all attributes of the sub table as a sequence field, and the class is then converted to the corresponding sequence. See the table below for details:

Common field				Flexible field
Student ID	Name	Gender	Age	Attribute	Comments	National player or not	Score
1	Student A	Male	16	[Basketball]	[The student often...]	[Yes]	[100]
2	Student B	Female	18	[Football]	[The student occasionally...]	[null]	[95]
3	Student C	Male	17	[Basketball, football]	[The student is good at..., and skillful at...]	[Yes, Yes]	[20,10]
…	…	…	…	…	…	…	…

2.1 Data structure

Define the test data structure as follows:

Main table (common field)

ield name	Type	Remarks
id	Value	Student ID (auto-incrementing)
name	String	Name
gender	Boolean	Gender
age	Value	Age
class	Value	Class (enumerable, stored as number)

Sub table (flexible fields)

Field name	Type	Remarks
id	Value	Student ID (same as main table)
atr	Value	Attribute (enumerable, stored as number)
comment	String	Comments (manually enter)
nt	Boolean	National player or not
rate	Value	Score

2.2 Move the data from database and store

The code for taking the data from database:

	A
1	=connect@l(“oracle12c”)
2	=A1.cursor(“SELECT * FROM MAIN ORDER BY ID”)
3	=A1.cursor@x(“SELECT * FROM SUB ORDER BY ID”)
4	=A3.group(id;~.(atr):atr,~.(comment): comment,~.(nt): nt,~.(rate): rate)
5	=joinx(A2:main,id;A4:sub,id)
6	=A5.new(main.id,main.name,main.gender,main.age,main.class,sub.atr,sub.comment,sub.nt,sub.rate)
7	=file(“seq_all.ctx”).create(#id,name,gender,age,class,atr,comment,nt,rate)
8	>A7.append@i(A6)

2.3 Generate test data

For the convenience of testing, we write the following SPL code to directly generate a composite table file seq_all.ctx:

	A	B
1	>rand@s(1)
2	>n=100000
3	=file(“seq_all.ctx”).create(#id,name,gender,age,class,atr,comment,nt,rate)
4	for 30	=to(n(A4-1)+1,nA4).new(~:id,“name”/~:name,rand(2)+1:gender,rand(17)+18:age,rand(10)+1:class,20.sort(rand()).to(rand(5)+5):atr,atr.len().(rands(“qwertyuiopasdfghjklzxcvbnm”,rand(200)+2)):comment,atr.len().([true,false,null](rand(3)+1)):nt,atr.len().(rand(100)+1):rate)
5		>A3.append@i(B4.cursor())

2.4 Query

Case 1: the age is greater than or equal to 20 and less than or equal to 25, and the classes include 1, 3, 6

Case 2: the class value “national player or not” under the attribute “basketball” is “yes”, and the class value “score” under the attribute “football” is less than 60

	A
1	/case 1: age is greater than or equal to 20 and less than or equal to 25, and the classes include 1, 3, 6
2	[1,3,6]
3	age>=20 && age<=25 && A2.pos(class)
4	/case 2: the class value “national player or not” under the attribute “basketball” is “yes”, and the class value “score” under the attribute “football” is less than 60
5	j(atr,nt,rate).count((#1==1 && #2==true) \|\| (#1==2 && #3<60))==2
6	/Main table fields
7	id,name,gender,age,class
8	/Sub table fields
9	atr,comment,nt,rate
10	/case1_query
11	=now()
12	=file(“seq_all.ctx”).open()
13	=A12.cursor@v(${A7},${A9};${A3})
14	=A13.fetch(5000)
15	=interval@ms(A11,now())
16	/case2_query
17	=now()
18	=file(“seq_all.ctx”).open()
19	=A18.cursor@v(${A7},${A9};${A5})
20	=A19.fetch(5000)
21	=interval@ms(A17,now())

A5 is a conditional expression where j is to join multiple sequence fields into a table sequence, and then count the number of each record that satisfies the condition. If the number is equal to the one in the condition, it means that this record needs to be taken.

III. Use attached table

The data structure in 2.1 can be regarded as the primary-sub table relationship. When two tables are in primary-sub relationship, we can regard the common field part as a base table, and the flexible field part as an attached table. For more information about attached table, visit: SPL Attached Table

3.1 Move the data from database and store

	A
1	=connect@l(“oracle12c”)
2	=A1.cursor(“SELECT * FROM MAIN ORDER BY ID”)
3	=A1.cursor@x(“SELECT * FROM SUB ORDER BY ID”)
4	=file(“attach_all.ctx”).create(#id,name,gender,age,class)
5	=A4.attach(sub,atr,comment,nt,rate)
6	=A4.append@i(A2)
7	=A5.append@i(A3)

3.2 Generate test data

The script for generating the attached table data based on the file generated in 2.3:

	A
1	=file(“seq_all.ctx”).open().cursor(id,name,gender,age,class)
2	=file(“seq_all.ctx”).open().cursor(id,atr,comment,nt,rate).news(#2;id,~:atr,comment(#):comment,nt(#):nt,rate(#):rate)
3	=file(“attach_all.ctx”).create(#id,name,gender,age,class)
4	=A3.attach(sub,atr,comment,nt,rate)
5	=A3.append@i(A1)
6	=A4.append@i(A2)

3.3 Query

The query criteria are the same as that described in 2.4.

	A
1	/case1: age is greater than or equal to 20 and less than or equal to 25, and the classes include 1, 3, 6
2	[1,3,6]
3	age>=20 && age<=25 && A2.pos(class)
4	/case 2: the class value “national player or not” under the attribute “basketball” is “yes”, and the class value “score” under the attribute “football” is less than 60
5	sub.count((atr==1 && nt==true) \|\| (atr==2 && rate<60))==2
6	/Main table fields
7	id,name,gender,age,class
8	/Sub table fields
9	atr,comment,nt,rate
10	/case1_query
11	=now()
12	=file(“attach_all.ctx”).open()
13	=A12.cursor@v(${A7},sub{${A9}}:sub;${A3})
14	=A13.fetch(5000)
15	=interval@ms(A11,now())
16	/case2_query
17	=now()
18	=file(“attach_all.ctx”).open()
19	=A18.cursor@v(${A7},sub{${A9}}:sub)
20	=A19.select@v(${A5})
21	=A20.fetch(5000)
22	=interval@ms(A17,now())

3.4 Search twice

When using the sequence type field, we can utilize the pre-cursor filtering technology to improve performance. However, the conditions on the sub table in attached table require reading all columns first, though some fields are unwanted. If few results are obtained by conditional filtering, it means many useless fields will be read.

Therefore, when using attached table, it is likely to be faster to first take the primary key by the condition-related fields before searching. In contrast, for the sequence type field, it is unnecessary to process this way.

First, find out the primary key sequence of the records that meet conditions, and then search for the results through this sequence. SPL script:

	A
1	/case1: age is greater than or equal to 20 and less than or equal to 25, and the classes include 1, 3, 6
2	[1,3,6]
3	age>=20 && age<=25 && A2.pos(class)
4	/case 2: the class value “national player or not” under the attribute “basketball” is “yes”, and the class value “score” under the attribute “football” is less than 60
5	sub.count((atr==1 && nt==true) \|\| (atr==2 && rate<60))==2
6	/Main table fields
7	id,name,gender,age,class
8	/Sub table fields
9	atr,comment,nt,rate
10	/case1_query
11	id
12	atr,nt,rate
13	/keys
14	=now()
15	=file(“attach_all.ctx”).open()
16	=A15.cursor@v(${A11},sub{${A12}}:sub;${A3})
17	=A16.fetch(5000)
18	=interval@ms(A14,now())
19	/find by keys
20	=now()
21	=file(“attach_all.ctx”).open()
22	=A21.cursor@v(${A7},sub{${A9}}:sub;A17.find(id))
23	=A22.fetch(5000)
24	=interval@ms(A20,now())
25	/case2_query
26	=now()
27	=file(“attach_all.ctx”).open()
28	=A27.cursor@v(${A11},sub{${A12}}:sub)
29	=A28.select@v(${A5})
30	=A29.fetch(5000)
31	=interval@ms(A26,now())
32	/find by keys
33	=now()
34	=file(“attach_all.ctx”).open()
35	=A34.cursor@v(${A7},sub{${A9}}:sub;A30.find(id))
36	=A35.fetch(5000)
37	=interval@ms(A33,now())

IV. Use row-based storage

SPL’s columnar storage adopts the data block and compression algorithm, leading to less access data volume and better performance as for the traversing operation. But for scenarios of index-oriented random data fetching, the complexity is much bigger due to the extra decompression process and the fact that each fetching is performed on the whole block. Therefore, the performance of columnar storage would be worse than that of row-based storage in principle.

For the same data, it is impossible to achieve optimal performance in both traversal operation and random data fetching. To obtain ultimate performance, we can store a redundant row-based storage composite table file, and establish an index for the primary keys of this composite table.

The script for converting to row-based storage and creating an index is as follows:

	A
1	/col2row
2	=file(“seq_all.ctx”)
3	=file(“seq_all_row.ctx”)
4	=A2.reset@r(A3)
5	/create_index
6	=A3.open().index(idx;id)

First, find out the primary key in the way of attached table, and then search for the results in the row-based storage index through the primary key. The script is as follows:

	A
1	/case1: age is greater than or equal to 20 and less than or equal to 25, and the classes include 1, 3, 6
2	[1,3,6]
3	age>=20 && age<=25 && A2.pos(class)
4	/case 2: the class value “national player or not” under the attribute “basketball” is “yes”, and the class value “score” under the attribute “football” is less than 60
5	sub.count((atr==1 && nt==true) \|\| (atr==2 && rate<60))==2
6	/Main table fields
7	id,name,gender,age,class
8	/Sub table fields
9	atr,comment,nt,rate
10	/case1_query
11	id
12	atr,nt,rate
13	/keys
14	=now()
15	=file(“attach_all.ctx”).open()
16	=A15.cursor@v(${A11},sub{${A12}}:sub;${A3})
17	=A16.fetch(5000).(#1)
18	=interval@ms(A14,now())
19	/find by keys
20	=now()
21	=file(“seq_all_row.ctx”).open()
22	=A21.icursor(${A7},${A9};A17.contain(id))
23	=A22.fetch(5000)
24	=interval@ms(A20,now())
25	/case2_query
26	=now()
27	=file(“attach_all.ctx”).open()
28	=A27.cursor@v(${A11},sub{${A12}}:sub)
29	=A28.select@v(${A5})
30	=A29.fetch(5000).(#1)
31	=interval@ms(A26,now())
32	/find by keys
33	=now()
34	=file(“seq_all_row.ctx”).open()
35	=A34.icursor(${A7},${A9};A30.contain(id))
36	=A35.fetch(5000)
37	=interval@ms(A33,now())

SPL Official Website 👉 http://www.scudata.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProc

SPL Learning Material 👉 http://c.scudata.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/ydhVnFH9

Youtube 👉 https://www.youtube.com/@esProc_SPL

esProc

mars • 193 View • 8 Months ago