3.3 Storing data in account order

 

After sorting data by CustomerID, we can achieve fast distinct operation and grouping operation on the field. During the traversal only the comparison with the previous record is needed, and there is no need to retain the intermediate result set. The computation is fast without the risk of memory overflow.

3.3.1 Stocked data

A
1 =file(“ShipVia.btx”).import@bi()
2 =file(“Orders.txt”).cursor@mt(CustomerID:string, OrderDate:datetime, ProductID:string, Quantity:int, Unit:string, Price:decimal, Amount:decimal, EmployeeID:int,EmployeeName:string,ShipVia:string).run(OrderDate=days@o(OrderDate), ShipVia=A1.pos@b(ShipVia)).sortx(CustomerID,OrderDate;1000000)
3 =file(“Orders_Account.ctx”).create@py(#CustomerID,#OrderDate, ProductID, Quantity, Unit, Price, Amount, EmployeeID,EmployeeName,ShipVia)
4 =A3.append(A2)
5 >A3.close()

A2 Sort data by CustomerID and OrderDate.
A3 Create composite table file structure. @p option enables the composite table to be segmented by units in the first field (here is CustomerID), and means that records having same CustomerID won’t be allocated to different threads during parallel processing – which ensures correct computation. Without the option, the data table will be simply segmented according to the number of records. This could give records having same CustomerID to two threads and result in errors in certain order-based computations. We haven’t used an order-based algorithm as yet and encountered any special segmentation requirements, so the parallel processing can be correctly handled even without @p option.

3.3.2 Newly increased data

Usually, the newly increased data isn’t ordered by CustomerID. They cannot be directly appended to the end of the stocked data that is already ordered. On the other hand, putting the ordered data and the newly increased data together and re-sorting them is time-consuming.

The composite table has a patch table that stores ordered data at a small size. We can only merge the newly increased data with the patch table and sort them all over while maintaining the original composite table unchanged. Later when the patch table grows bigger – to an appropriate size we think, we merge it with the original composite table and sort the whole data. To perform an order-based computation, we retrieve data from the original composite table and from the patch table respectively, and merge them to return. This way the performance is a little lower than that with only one chunk of ordered data, but we can still accomplish the computation fast based on the orderliness.

When to merge the patch table with the original data is related to the cycle time of newly increased data. If new data is generated every day, usually we can perform a big merge of the patch table and the original composite table every month. This way the patch table contains only data of one month at most, and the original composite table stores all data one month ago. Probably the former is far smaller than the latter. The merge per day involves a relatively small amount of data, and data appending is fast. As we just need to perform the big merge once every month, the relatively long process becomes easier to bear.

Generally, the newly increased data is stored in a separate file. Suppose we name it Orders_add.txt:

A B
1 =file(“ShipVia.btx”).import@bi()
2 =file(“Orders_add.txt”).cursor@t(CustomerID:string, OrderDate:datetime, ProductID:string, Quantity:int, Unit:string, Price:decimal, Amount:decimal, EmployeeID:int, EmployeeName:string, ShipVia:string).run(OrderDate=days@o(OrderDate), ShipVia=A1.pos@b(ShipVia)).sortx(CustomerID,OrderDate;1000000)
3 =file(“Orders_Account.ctx”) =A3.open()
4 =B3.append@a(A2)
5 if day(now())==1 >A3.reset()
6 >B3.close()

A2 Retrieve Orders table’s newly increased data file and sort it by CustomerID and OrderDate.
A4 Append the newly added data to the composite table file’s patch table. @a option enables appending data through order-based merge.
A5-B5 Judge whether the current date is the first day of the month. When A5 returns true, use reset() function to merge data in the patch table with the original composite table and reset the patch table.

As @p option is used in creating the composite table file, SPL will handle the appending of newly increased data to the patch table automatically according to the existing segmentation way. This ensures that the subsequent parallel processing is correct.