SPL Practice: integerization during data dump
Using SPL for performance optimization, such as converting data types like string to integer during data dump, can reduce storage space and improve computing performance. This article will present how to implement integerization through a practical example.
Problem description
The following table is the data structure of a certain space-time collision problem:
Field name | Field type | Field meaning | Remarks | Sample data |
---|---|---|---|---|
no | String | Object flag | Unique flag of object | 100000000009 |
ct | Int | Timestamp | Unix timestamp (seconds) | 1690819200 |
lac | String | Space flag 1 | 40000 | |
ci | String | Space flag 2 | 66000000 |
After understanding the business, we know that all the values of ‘no’ are composed of numbers and can be directly converted to Long integer; ‘lac’ and ‘ci’ always appear in pairs and can be merged as one field called ‘loc’, and the count range of ‘loc’ after de-duplication is approximately 270,000, making it possible to use ‘Int’ integer.
Since the field ‘no’ only involves comparison in subsequent calculations and does not participate in positioning calculation, it does not need to be sequence-numberized, and we just need to convert its value to integer. On the contrary, the field ‘loc’ will involve positioning calculation in subsequent calculations, so it needs to be sequence-numberized, that is, converting its values to natural numbers starting from 1.
Data integerization and dump
The data structure after dumping is as follows:
Field name | Data type | Field meaning | Remarks | Sample data |
---|---|---|---|---|
no | Long | Object flag | Unique flag of object | 100000000009 |
ct | Int | Timestamp | Unix timestamp (seconds) | 1690819200 |
loc | Int | Space flag | Sequence number of space flag | 10282 |
Compared with the original data structure, the following two changes are made during data dump:
1. merge the ‘lac’ and ‘ci’ fields as one ‘loc’ field and convert their values to Int sequence number. The original ‘lac’ and ‘ci’ fields, as dimension table, are stored separately;
2. convert the data type (number string) of the ‘no’ field to Long integer.
SPL code:
A | |
---|---|
1 | =connect@l("oracle12c") |
2 | =loc_list=A1.query("SELECT DISTINCT lac,ci FROM testdata").keys@i(lac,ci) |
3 | =file("loc_list.btx").export@b(A2) |
4 | =A1.cursor@x("SELECT no,ct,lac,ci FROM testdata").new(long(no):no,int(ct):ct,loc_list.pfind(lac,ci):loc) |
5 | =file("tmp.btx").export@b(A4) |
6 | =file("tmp.btx").cursor@b().sortx(no,ct) |
7 | =file("1.day.ctx").create@y(#no,#ct,loc) |
8 | >A7.append@i(A6) |
9 | =movefile(file("tmp.btx")) |
A1: connect to the database; the @l option represents that the returned field name and table name are in lowercase;
A2: fetch the de-duplicated space flags from the database and create an index;
A3: save the dimension table information with space flags to a bin file;
A4: while fetching data from the database, convert the values of ‘no’ to long integer, and utilize the index to find out the sequence number corresponding to ‘loc’ and ‘ci’ from loc_list;
A5 - A9: save the ordered results of ‘no’ and ‘ct’ to a composite table.
Restore through sequence number
Since the dumped fact table loses the information of original ‘lac’ and ‘ci’ fields, we need to restore such information by utilizing the above-mentioned bin file saved when dumping. For example, the following code is to restore the ‘lac’ and ‘ci’ values of a certain calculation result - table sequence T(no, ct, loc):
A | |
---|---|
1 | =file("loc_list.btx").import@b() |
2 | =T.new(string(no):no,ct,A1(loc).lac,A1(loc).ci) |
A1: read the dimension table;
A2: since the value of loc in T correspond to the sequence number of record of dimension table in A1, we can directly take the field corresponding to A1(loc).
New data processing
The data will continue to increase, and the space flag on new data may also be new, so we need to append the data with new space flag to the original dimension table. SPL code:
A | |
---|---|
1 | =file("loc_list.btx").import@b().keys@i(lac,ci) |
2 | =connect@l("oracle12c") |
3 | =A2.query@x("SELECT DISTINCT lac,ci FROM testdata where ct >="/(long(date(now()))\1000)) |
4 | =A3.select(!A1.find(lac,ci)) |
5 | =file("loc_list.btx").export@ab(A4) |
In this way, when dumping new data for a day, we just need to store new data into the dimension table bin file ‘loc_list.btx’ where the data has been appended, which will not affect the corresponding relationship of original space flag sequence number. SPL code:
A | |
---|---|
1 | =connect@l("oracle12c") |
2 | =loc_list=file("loc_list.btx").import@b().keys@i(lac,ci) |
3 | =A1.cursor@x("SELECT no,ct,lac,ci FROM testdata where ct >="/(long(date(now()))\1000)).new(long(no):no,int(ct):ct,loc_list.pfind(lac,ci):loc) |
4 | =file("tmp.btx").export@b(A3) |
5 | =file("tmp.btx").cursor@b().sortx(no,ct) |
6 | =file("2.day.ctx").create@y(#no,#ct,loc) |
7 | >A6.append@i(A5) |
8 | =movefile(file("tmp.btx")) |
SPL Official Website 👉 https://www.scudata.com
SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL
SPL Learning Material 👉 https://c.scudata.com
SPL Source Code and Package 👉 https://github.com/SPLWare/esProc
Discord 👉 https://discord.gg/cFTcUNs7
Youtube 👉 https://www.youtube.com/@esProc_SPL
Chinese version