Multidimensional Analysis Backend Practice 6: big fact table associate with small dimension table

 

Abstract

This series of articles explain how to perform multidimensional analysis (OLAP) in steps with detailed samples. Click to learn more in Multidimensional Analysis Backend Practice 6: big fact table associate with small dimension table. Multidimensional Analysis Backend Practice 6: big fact table associate with small dimension table

Aim of practice

This issue aims to associate all the new customer data in the customer wide table with multiple dimension tables such as the branch department table, and perform slice and grouping calculation with dimension table fields on the basis of last issue.

 

The steps of practice are:

1. Pre-process the customer table and perform the sequence numberization of the dimensions.

2. Pre-calculate the dimension tables and perform the association of each dimension.

3. Associate the customer data in wide table and the dimension tables, then perform slice and grouping: implemented by calling SPL with Java code.

 

The associations between the customer table and each dimension table are shown in the picture:

undefined 

The dimension tables include:

1. The branch department table “department”

2. The location table “location”

3. The country table “countries”

4. The region table “regions”

5. The VIP table “vip”

6. The credit grading table “credit_grade”

The dimension table fields are the same as those in last issue.

 

The purpose of multidimensional analysis calculation can be described in the following SQL statement of Oracle:

select v.interest_rate1,ct.country_id,c.job_id,sum(c.balance) sum,count(c.customer_id) count
from customer c
 left join department d on c.department_id=d.department_id
 left join locations l on d.location_id=l.location_id
 left join countries ct on l.country_id=ct.country_id
 left join regions r on r.region_id=ct.region_id
 left join vip v on c.vip_id=v.vip_id
 left join credit_grade cg on c.credit_grade=cg.credit_grade
where d.applause_rate in (26,30,33,65,67,74,75,77)
and r.region_name in ('Asia','Europe','Americas')
and v.charge_ratio in (9,11)
and cg.quota2 in(50000,80000)
and c.job_id in ('AD_VP','FI_MGR','AC_MGR','SA_MAN','SA_REP')
and c.flag1='1' and c.flag8='1'
group by v.interest_rate1,ct.country_id,c.job_id

Sequence numberization of dimension

The data of branch department table are as follows:

undefined 

What stores in the department_id field of the customer table is a manual code (as shown in the following picture) which needs to be converted to sequence numbers in the branch department table.

undefined 

We can append conversion code in etl.dfx.


A

1

=file("data/customer.ctx").open().cursor()

2

=file("data/department.btx").import@b().keys(department_id).derive(#:num)

3

=A1.switch(department_id,A2:department_id)

4

=A3.new(department_id.num:department_id,job_num,employee_id,begin_date,customer_id,first_name,last_name,phone_number,job_title,balance,department_name,flag1,flag2,flag3,flag4,flag5,flag6,flag7,flag8,vip_id,credit_grade)

5

=file("data/customerDept.ctx").create@y(#department_id,#job_num,#employee_id,#begin_date,#customer_id,first_name,last_name,phone_number,job_title,balance,department_name,flag1,flag2,flag3,flag4,flag5,flag6,flag7,flag8,vip_id,credit_grade).append(A4)

A1: open the cursor based on customer composite table.

A2: read the data of branch department table, use department_id as the primary key, and add calculation on “num” column to save the sequence number.

A3: use the customer composite table cursor to associate A2.

A4: replace “department_id” with “num” in the cursor.

A5: export the converted cursor to customerDept.ctx, as shown in the following picture:

 

Pre-associate dimension tables

Modify init.dfx to load the dimension tables in memory and associate them. The code is as follows:


A

B

1

=file("data/job.btx").import@ib()

>env(job,A1)

2

=file("data/department.btx").import@b().keys(department_id)

=file("data/vip.btx").import@b().keys(vip_id)

3

=file("data/credit_grade.btx").import@b().keys(credit_grade)

=file("data/locations.btx").import@b().keys(location_id)

4

=file("data/countries.btx").import@b().keys(country_id)

=file("data/regions.btx").import@b().keys(region_id)

5

=A2.switch(location_id,B3.switch(country_id,A4.switch(region_id,B4)))

6

=env(department,A5)

=env(vip,B2)

7

=env(credit_grades,A3)


A1: retrieve the data of job from the bin file, and @i is used to read them as a sequence when there is only one column.

B1: save as global variable “job”.

A2-B4: retrieve the data of branch department, vip, credit grading, location, country, and region from the bin file respectively, and create primary keys.

A5: associate the “department”, “locations”, “countries” and “regions” tables according to their layers.

A6: save as global variable “department”. B6: save as global variable “vip”.

A7: save as global variable “credit_grades”.

 

The written init.dfx should be put in the main path of node, and it can be called automatically when the node is started or restarted.

 

Association calculation

Modify the olap.dfx and customer.dfx in previous issues.

 

The given parameters are still: the value of table name arg_table is customer, and other parameters are in json format. The arg_json smaple is as follows:

{
       aggregate:
              [
                     {
                            func:"sum",
                            field:"balance",
                            alias:"sum"
                     },
                     {
                            func:"count",
                            field:"customer_id",
                            alias:"count"
                     }
              ],
       group:
              [
                     "vip_id.interest_rate1:vip_id_interest_rate1",
               "department_id.location_id.country_id.country_id:department_id_location_id_country_id",
                     "job_id"
              ],
       slice:
              [
                     {
                            dim:"department.applause_rate",
                            interval:null,
                            value:[26,30,33,65,67,74,75,77]
                     },
 
                     {
                      dim:"department.location_id.country_id.region_id.region_name",
                            value:["Asia","Europe","Americas"]
                     },
                     {
                            dim:"vip.charge_ratio",
                            value:[9,11]
                     },
                     {
                            dim:"credit_grades.quota2",
                            value:[50000,80000]
                     },
                     {
                            dim:"job_id",
                      value:["AD_VP","FI_MGR","AC_MGR","SA_MAN","SA_REP"]
                     },
                     {
                            dim:"flag1",
                            value:"1"
                     },
                     {
                            dim:"flag8",
                            value:"1"
                     }
              ]
}

There are department_detail.location_id, department_detail.applause_rate, etc. in “group” and “slice”. We should first convert the filtering conditions to the field conditions of the customer table itself, performing filtering before cursor creating, and then associate the customer table and the dimension tables when performing multidimensional analysis calculation.

 

The colon in department_detail.location_id.country_id.country_id:department_detail_location_id_country_id is used to give an alias for the attribute-lized foreign key field, which is used as the field name of the final result.

 

Step one: modify customer.dfx.

 

We should first modify customer.dfx before rewriting olap.dfx. Change the way we handle with arg_json parameter and the SPL code is:


A

B

C

D

1

Func




2


if A1.value==null

return   "between("/A1.dim/","/A1.interval(1)/":"/A1.interval(2)/")"

3


else if ifa(A1.value)

return   string(A1.value)/".contain("/A1.dim/")"

4


else if   ifstring(A1.value)

return A1.dim/"==\""/A1.value/"\""

5


else 

return   A1.dim/"=="/A1.value

6

func




7


=[]

=A6.select(like(dim,B6/".*"))

8


for C7

=[func(A1,B8)]


9



>B7|=C8


10


=${B6}.pselect@a(${B7.concat("  &&")})

=A6\C7


11


=C10.create().record([C6,null,B10])


12


return B11|C10



13

=json(arg_json)

=date("2000-01-01")


14

=A13.aggregate.(~.func/"("/~.field/"):"/~.alias).concat@c()


15

=A13.group.(if(~=="job_id","job_num",~))


16

=A15.(if(~=="begin_yearmonth","begin_date\\100:begin_yearmonth",~)).concat@c()

17

=A13.aggregate.(field)

=A15.(if(~=="begin_yearmonth","begin_date",~))

=(A17|B17).id()


18

=[]




19

for C17

if   like(A19,"department_id.*")

>C17(#A19)="department_id"

>A18|=["department_id,department:#"]

20


if   like(A19,"vip_id.*")

>C17(#A19)="vip_id"

>A18|=["vip_id,vip:#"]

21


if   like(A19,"credit_grade.*")

>C17(#A19)="credit_grade"

>A18|=["credit_grade,credit_grade:#"]

22

=C17.id().concat@c()



23

=func(A6,A13.slice,"vip","vip_id")

=func(A6,A23,"credit_grades","credit_grade")

=func(A6,B23,"department","department_id")

24

=[]




25

for C23

if A25.dim=="begin_date"   && A25.value!=null

>A25.value=int(interval@m(C13,eval(A25.value))*100+day(eval(A25.value)))

26


else if   A25.dim=="begin_date" && A25.value==null

=A25.interval.(~=int(interval@m(C13,eval(~))*100+day(eval(~))))

27


else if   A25.dim=="job_id"

>A25.dim="job_num"


28



>A25.value=A25.value.(job.pos@b(~))

29


else if like(A25.dim,   "flag?")

>A25.value=int(A25.value)


30


=[func(A1,A25)]

>A24|=B30


31

=A13.group.(~.split(":")).(~=~.m(-1))|A13.aggregate.(alias)

=A31(A31.pos("job_id"))="job(job_num):job_id"

32

=A31(A31.pos("begin_yearmonth"))="month@y(elapse@m(date(\""/C13/"\"),begin_yearmonth)):begin_yearmonth"

33

=A31(A31.pos("begin_date"))="elapse@m(B4,begin_date\\100)+(begin_date%100-1):begin_date"

34

return   A22,A14,A16,A24.concat("&&  "),A31.concat@c(),A18.concat(";")

 

Area from A1 to C5 is subprogram which only executes when being called. The code is exactly the same as those in previous issues.

 

Area from A6 to C12 is a new added subprogram which will be introduced later when being called.

 

A12: parse arg_json to a table sequence and the return result is a nested multi-layer table sequence as follows:

 

undefined 

The “aggregate” in it is:

undefined 

The “group” is:

undefined 

The “slice” is:

undefined 

A14: first calculate “aggregate” to colon-concatenated string sequences and then concatenate the sequences to a string with comma: sum(balance):sum,count(customer_id):count, i.e., the aggregate expression.

 

A15: replace the job_id in “group” with job_num.

 

A16: convert the begin_yearmonth in “group” to integers.

 

A17-C17: calculate the field names to be read from the composite table. And the result is:

undefined 

Those fields in associated dimension tables need to be converted to fields of the composite table itself.

 

A18: define an empty sequence which prepares to store the dimension tables and fields to be associated.

 

A19: loop through C17, and the loop body is from B19 to C21.

 

B19: if A19 contains department_id.*, then it needs to associate with branch department table sequence.

 

C19: assign the corresponding member of C17 as department_id.

 

D19: add a member "department_id,department:#" in A18. the department_idof the customer table has already been sequence numberized, so we can directly write #.

 

B20-D20: process the vipdimension table in the same way.

 

B21-D21: process the credit_gradedimension table in the same way.

 

When the loop of A19 is over, the result of A18 is as follows:

undefined 

A22: de-duplicate the processed C17, and concatenate the result to a string with commas: "balance,customer_id,department_id,job_num,vip_id".

 

A23: use A13.slice, dimension table vipand dimension table field vip_idas the given parameters to call the A6 subprogram.

 

The parameters are stored in A6, B6 and C6 respectively. And A6 is:

undefined 

 

Here the third member of A6 is vip.charge_ratio, and the purpose of subprogram is to process it as the filtering condition for “vip_id”, the field of the composite table itself.

 

B7: define an empty sequence which prepares to store the filtering conditions related to vipas follows:

undefined 

B8: loop through C7, and the loop body is from C8 to C9.

 

C8: call A1, and calculate B8 as the esProc expression:

undefined 

C9: union C8 in B7:

undefined 

B10: use all the conditions in B7 to filter the dimension table vip, and obtain the sequence number of vip:

undefined 

C10: remove C7 from A6:

undefined 

B11: create a new record in C10:

undefined 

B12: union B11 and the return result of C10, and end the A6 subprogram.

A23: after the subprogram returns, the filtering conditions related to “vip” in the A23 sequence are converted to the conditions in the vip_id field of the customer table as follows:

undefined 

B23: convert the “credit_grade.*” conditions in A23 to the conditions in the credit_grade field of the customer table in the same way.

 

C23: convert the “department.*” conditions in A23 to the conditions in the department_id field of the customer table in the same way.

 

A24: define an empty sequence which prepares to store the slice (filtering conditions) expressions.

 

A25: loop through sliceand the loop body is the area from B25 to C30 in which from B25 to C29 is the optimization conversion to the valueor intervalof slice.

 

B25: if the “dim” of A25 is begin_date and the “value” is not null, that is, the begin_date is a single date. In this case, C25 converts that single date to an integer and assigns it to “value”.

 

B26: if the “dim” of A25 is begin_date and the “value” is null, that is, the begin_date is within a range of dates. In this case, C25 converts the beginning and ending dates to integers and assigns them to “interval”.

 

B27: if the dimof A10 is job_id, that is, job_id is an enumeration value, for example, ["AD_VP","FI_MGR","AC_MGR","SA_MAN","SA_REP"].contain(job_id). In this case, C27 converts the dimof A10 to job_num and C28 converts the enumeration value valueof A25 to the position in the sequence of global variable job, i.e., job_num integer sequence. For example: [5,7,2,15,16].

 

B29: if the dimof A25 is flag1, flag2, ..., flag8, that is, the flag bit equals to 1or 0. In this case, C29 converts the valueof A25 from strings to integers.

 

B30: take the result of B25 to C29, which converts the “value” or “interval” of “slice” for performance optimization, as parameter to call subprogram A1.

 

The subprogram A1 (from B2 to B5) is identical to the func code in the first customer.dfx, which will not be elaborated here.

 

C30: append the return result of func A1 to A24. Continue the loop in A25 until it is over. And the sequence of slice expressions is well prepared.

A31: prepare the conversion expression of result set display value. Concatenate A13.group.(~.split(":")).(~=~.m(-1)) with A13.aggregate.alias sequence as follows:

undefined 

A13.group.(~.split(":")).(~=~.m(-1)) is used to take the alias after the colon if there is one in the group, otherwise the value of groupis directly taken.

 

C31: replace the job_id in A31 with conversion statement which is used to convert the job_num in result set to job_id.

 

A32-A33: generate expressions to convert the begin_date and begin_yearmonth in the result from integers to dates.

 

A34: return each expression string respectively:

 

The required fields: balance, customer_id, department_id, job_num, vip_id

 

The aggregate expression: sum(balance):sum,count(customer_id):count

 

The grouping expression;

vip_id.interest_rate1:vip_id_interest_rate1,department_id.location_id.country_id.country_id:department_id_location_id_country_id,job_num

The slice expression:

[2].contain(department_id)

&& [5,8].contain(credit_grade)

&& [2].contain(vip_id)

&& [5,7,2,15,16].contain(job_num)

&& flag1==1 && flag8==1

The result set display value conversion expression:

vip_id_interest_rate1,department_id_location_id_country_id,job(job_num):job_id,sum,count

 

The foreign key associate expression:

department_id,department:#;vip_id,vip:#

 

Step two: modify olap.dfx.

 

The cellset parameters are arg_table (whose value is customer) and arg_json. The SPL code is as follows:


A

1

=call(arg_table/".dfx",arg_json)

2

=file("data/"/arg_table/"Dept.ctx").open()   

3

=A2.cursor@m(${A1(1)};${A1(4)};2)

4

=A3.switch(${A1(6)})


=A4.groups(${A1(3)};${A1(2)})


=A5.new(${A1(5)})


return A6

 

A1: call customer.dfx based on the value of parameter arg_table, the calling parameter is arg_json, and the return value is each expression string returned by customer.dfx.

 

A2: open the composite table based on the arg_table parameter.

 

A3: create a cursor on A2 with filtering conditions, and the statement actually executed is:

= A2.cursor@m(balance,customer_id,department_id,job_num,vip_id; [4,5,6,12,16,18,22,23,24,27].contain(department_id) && [5,8].contain(credit_grade) && [2].contain(vip_id) && [5,7,2,15,16].contain(job_num) && flag1==1 && flag8==1;2)

A4: associate the cursor with the dimension tables in memory, and the statement actually executed is:

=A3.switch(department_id,department:#;vip_id,vip:#)

 

A5: perform small result set grouping on the associated cursor. And the statement actually executed is:

=A4.groups(vip_id.interest_rate1:vip_id_interest_rate1,department_id.location_id.country_id.country_id:department_id_location_id_country_id,job_num;sum(balance):sum,count(customer_id):count), and the execution result is:

undefined 

A6: convert the grouping results to display values, and the statement actually executed is:

=A5.new(vip_id_interest_rate1,department_id_location_id_country_id,job(job_num):job_id,sum,count), and the result is:

undefined 

A7: return the calculation result.

 

Step three: call in Java code

After being written, olapMem.dfx can be called as a stored procedure during multidimensional analysis, and most of the Java code is the same as those in previous issues with differences in the name of dfx file, and the values of arg_table and arg_json. The Java code is:

public void testOlapServer(){

       Connection con = null;

       java.sql.PreparedStatement st;

       try{

              // create a connection

              Class.forName("com.esproc.jdbc.InternalDriver");

              // get the connection based on url

              con= DriverManager.getConnection("jdbc:esproc:local://?onlyServer=true&sqlfirst=plus");

// call the stored procedure, and olap is the file name of dfx

st =con.prepareCall("call olap(?,?)");

st.setObject(1, "newCustomer");

st.setObject(2, "{aggregate:\[{func:\\"sum\\",field:\\"balance\\",alias:\\"sum\\"},{func:\\"count\\",field:\\"customer\_id\\",alias:\\"count\\"}\],group:\[\\"vip\_id.interest\_rate1:vip\_id\_interest\_rate1\\",\\"department\_id.location\_id.country\_id.country\_id:department\_id\_location\_id\_country\_id\\",\\"job\_id\\"\],slice:\[{dim:\\"department.applause\_rate \\",interval:null,value:\[26,30,33,65,67,74,75,77\]},{dim:\\"department.location\_id.country\_id.region\_id.region\_name\\",value:\[\\"Asia\\",\\"Europe\\",\\"Americas\\"\]},{dim:\\"vip.charge\_ratio\\",value:\[9,11\]},{dim:\\"credit\_grades.quota2\\",value:\[50000,80000\]},{dim:\\"job\_id\\",value:\[\\"AD\_VP\\",\\"FI\_MGR\\",\\"AC\_MGR\\",\\"SA\_MAN\\",\\"SA\_REP\\"\]},{dim:\\"flag1\\",value:\\"1\\"},{dim:\\"flag8\\",value:\\"1\\"}\]}");//arg\_json

// execute the stored procedure

st.execute();

// get the result set

ResultSet rs = st.getResultSet();

// continue to process the result set and show it out

       }

       catch(Exception e){

              out.println(e);

       }

       finally{

// close the connection

              if (con!=null) {

                     try {con.close();}

                     catch(Exception e) {out.println(e);       }

              }

       }

}

The total execution time of Java code plus SPL is 3 seconds.