"When the amount of data in a single database is too large to affect the performance, the data ca .."

Hoo RaqForum 19 No.
1 Reply • 663 View • 4 Years ago

Using esProc to realize the sub database summary

When the amount of data in a single database is too large to affect the performance, the data can be split into multiple servers, each server only bears part of the calculation pressure, and then the SPL merges the calculation results. In particular, data can be divided into historical database and current real-time database, and T + 0 calculation can be realized by SPL. Here are a few typical examples to illustrate the use of sub database summary.

Filter

The order table, orders, is split and stored in two Oracle databases. The data source names are orcla and orclb respectively. Please filter out orders with amount greater than or equal to 10000.

SPL code is as follows：

	A	B	C
1	=[connect("orclA"),connect("orclB")]		/ Connect multiple data sources
2	select * from orders where amount>=10000		/SQL filter
3	fork A1	=A3.query(A2)	/ Parallel computing
4	=A3.conj()		/ Merge result

Sort

Please filter out orders with an amount greater than or equal to 10000, and sort by order of order amount.

In the sorting algorithm of the sub database, the calculation results of each thread cannot be simply merged, but merge function is used for merging. The SPL code is as follows:

6	select * from orders where amount>=10000 order by amount		/SQL sorting
7	fork A1	=A7.query(A2)
8	=A7.merge(AMOUNT)		/Merge

Group and aggregation

Please group the order table by year and month and sum the amount field of each group of data.

In the sub database group aggregation algorithm, the combined data should be grouped and aggregated again. The SPL code is as follows:

10	select extract(year from orderTime)y,extract(month from orderTime)m,sum(amount) amount from orders group by extract(year from orderTime),extract(month from orderTime)		/SQL group and aggregation
11	fork A1	=A11.query(A10)	/Group and aggregation in each sub database
12	=A11.conj()		/Merge
13	=A12.groups(Y,M;sum(AMOUNT):AMOUNT)		/Group and aggregate again

If the number of groups is large, the order should be used to improve the performance. The specific method is: in SQL, sort by group fields in advance to make the SQL results orderly, then use the merge algorithm to merge the data, and use the group@O to group and aggregate. The SPL code is as follows:

14	select extract(year from orderTime)y,extract(month from orderTime)m,sum(amount) amount from orders group by extract(year from orderTime),extract(month from orderTime) order by y,m		/SQL group and aggregate
15	fork A1	=A15.query(A14)
16	=A15.merge(Y,M)		/Merge
17	=A16.groups@o(Y,M;sum(AMOUNT):AMOUNT)		/Orderly group and aggregate

Filter after group aggregation

Please group the order table by year and month, sum the amount field of each group of data, and then filter out the results with the summary value greater than 110000000.

SQL implements this algorithm, usually with a having statement after group by, that is:

select extract(year from orderTime)y,extract(month from orderTime)m,sum(amount) amt from orders group by extract(year from orderTime),extract(month from orderTime) having sum(amount)>=110000000

When implementing the algorithm in the sub database, the above SQL cannot be used directly. The group aggregation algorithm should be implemented first, and then use SPL to implement filtering. The code is as follows:

19	select extract(year from orderTime)y,extract(month from orderTime)m,sum(amount) amt from orders group by extract(year from orderTime),extract(month from orderTime)		/SQL without filtering
20	fork A1	=A20.query(A19)	/Group aggregation in each sub database
21	=A20.conj()		/Merge
22	=A21.groups(Y,M;sum(AMT):AMT)		/Group aggregation again
23	=A22.select(AMT>=110000000)		/Filter
24	=A1.(~.close())		/Close the connections

Sub databases with different structure

Apart from sub databases with the same structure, SPL also supports sub databases with different structure. It should be noted that the SQL syntax of different databases is not universal, such as the number truncation function, which is written as trunc in Oracle and truncate in MySQL. In order to deal with different syntax correctly, we should first write SPL standard SQL, and then translate it into different local SQL by sqltranslate function of SPL.

For example, the orders table is stored in Oracle and mysql. The data source names are orcl and my. Please query the records with the amount field greater than or equal to 10000, and truncate the amount field to round.

SPL code is as follows：

	A	B	C
1	=[[connect("orcl"),"ORACLE"],[connect("my"),"MYSQL"]]		/ Connect data source, mark database type
2	select ORDERID,ORDERTIME,truncate(AMOUNT,0),CLIENTID,SALESID from orders where amount>=10000		/Standard SQL
3	fork A1	=A2.sqltranslate(A3(2))	/ Convert to local SQL
4		=A3.query(B3)	/Query
5	=A3.conj()		/ Merge result

Join

In addition to single table in split databases calculation, SPL also supports multi table join in split databases calculation. In this case, the fact table and dimension table should be handled separately. The fact table should be split and stored in separate databases, and each database stores part of the data. The dimension table should not be split, and should be copied to each database in full.

For example, the sales table is a dimension table of the orders table. The two tables take the salesid as the join field. Please group according to the dept field of the sales table to calculate the sales of each department. Assuming that the sales table has been fully stored in each database, the SPL code is as follows:

	A	B	C
1	=[connect("orclA"),connect("orclB")]		/ Connect multiple data sources
2	select sales.dept,sum(orders.amount)amount from orders,sales where orders.salesID=sales.salesID group by sales.dept		/SQL group aggregation
3	fork A1	=A3.query(A2)	/ Parallel computing
4	=A3.conj()		/Merge results
5	=A4.groups(DEPT;sum(AMOUNT):AMOUNT)		/Group aggregation again
6	=A3.(~.close())		/Close connections